STUDYSAS BLOG

7 Mistakes in LB Unit Conversion That Still Show Up in SDTM

2026-04-09T20:53:00.007-04:00

7 Mistakes in LB Unit Conversion That Still Show Up in SDTM | StudySAS

LB unit conversion is not hard to understand. The variable definitions are clear. The model is well-documented. And yet unit-related problems remain among the most common issues in SDTM submissions — not because teams do not know the rules, but because the failures are quiet. Nothing crashes. The dataset looks valid. Pinnacle 21 returns a clean report. Then a reviewer runs a simple cross-tabulation and finds creatinine flagged HIGH at a value that belongs well within the normal range — because the reference range was never converted to match the standard unit.

This post is a close look at seven implementation mistakes, each with enough context and worked examples to make the failure mode visible before it reaches a submission.

LBORRES is what was collected. Verbatim. Untouched.
LBSTRES* is the standardized representation. One unit per test.
LBSTNR* must follow the same unit system as LBSTRES*.

That three-line rule explains most of what goes wrong. The seven mistakes below are elaborations of what happens when any part of that contract is broken.

Background: what the standard requires

SDTMIG defines the LB domain variable roles clearly. LBORRES holds the result exactly as collected from the source. LBORRESU holds the unit as reported by the lab. LBSTRESC holds the standardized result in character form. LBSTRESN holds the numeric portion of that standardized result. LBSTRESU holds the standard unit — and this unit must be consistent for a given LBTESTCD across the entire dataset. LBSTNRLO and LBSTNRHI hold the standardized reference range limits, which must be in the same unit as LBSTRESU. LBNRIND holds the normal/abnormal flag, and SDTMIG explicitly requires that sponsors document whether this flag is derived from original ranges or standard ranges.

FDA's Study Data Technical Conformance Guide v6.1 (December 2025) adds a structural requirement on top of this: for clinical submissions, the LB domain should carry SI-standardized results, and a parallel custom domain LC should carry conventional-unit results. These are not alternative approaches — they are parallel datasets required in the same submission package. PMDA's guidance reinforces SI use in the standardized result fields and additionally requires that the conversion equation be documented in reviewer-facing materials when original and converted values coexist.

That is the framework. Now, where it breaks.

01 Changing LBORRES

This is the foundational error, and it is more common than it should be. LBORRES is the traceability anchor for the entire LB record. Its purpose is to preserve what the site or lab actually reported, in the exact form it was reported. The moment you alter LBORRES, you have introduced a gap between the submitted dataset and the source document — and that gap is very difficult to justify under audit.

The alterations I see most often are trimming trailing zeros (1.20 becomes 1.2), standardizing case (POSITIVE becomes Positive), removing comparison operators (<0.10 becomes 0.10), and normalizing text qualifiers (Below LOQ becomes BLQ). Each of these looks like a cleanup. None of them is permitted.

Scenario	Source value	LBORRES — wrong	LBORRES — correct
Trailing zero trimmed	1.20 mg/dL	1.2	1.20
Operator stripped	<0.10 ng/mL	0.10	<0.10
Text normalized	Below LOQ	BLQ	Below LOQ
Case changed	NEGATIVE	Negative	NEGATIVE

The right place for cleanup is LBSTRESC. If the sponsor has a standard for how qualitative results should be represented, that standard is applied in the standardized layer — not by modifying the original. LBORRES is read-only from the moment data is collected.

Regulatory implication: FDA expects that submitted data can be traced back to the source CRF or lab report. If LBORRES has been altered from the source value, the auditor's reconciliation path is broken. This is not a conformance warning — it is a data integrity issue.

02 Using LBSTRESC as a cosmetic copy of LBORRES

LBSTRESC exists to do real work: converting units, standardizing result formats, or mapping to a controlled representation. When a team populates LBSTRESC merely by reformatting LBORRES — trimming a zero, changing 1.20 to 1.2, or left-justifying a string — without any underlying standardization rule, the column exists but contributes nothing.

The problem goes beyond cosmetics. A dataset where LBSTRESC is a visual cleanup of LBORRES looks standardized to a tool like Pinnacle 21 — all the required variables are populated. But a reviewer who pulls LBSTRESN expecting a consistent, analysis-ready numeric result will find that the values are in whatever unit each site happened to report, because no actual standardization happened.

This mistake is hardest to detect in single-site studies where the lab reports all results in one unit. Everything looks fine because conversion was never needed. The hidden problem surfaces when the same mapping spec is reused in a multi-site follow-on study with a different lab, and the team assumes the spec is handling standardization when it is only handling formatting.

The test: for every LB record, ask whether LBSTRESC would have a different value if LBORRES were in a different unit. If the answer is no — if the logic does not touch the unit at all — then LBSTRESC is not doing standardization work. It is doing formatting work. That belongs in output displays, not in the SDTM submission dataset.

03 Converting the result but not the reference range

This is the most dangerous mistake on the list because it is completely silent in conformance validation and produces flags that look correct but are not. The result moves into the standard unit. The reference range stays in the original unit. LBNRIND is derived by comparing the converted result against the unconverted range. The comparison is numerically meaningless, but the output looks like a legitimate flag.

Here is a concrete worked example. A study collects glucose from US sites in mg/dL and from European sites in mmol/L. The sponsor selects mmol/L as the standard unit. The JCTLM-recommended conversion factor for mg/dL to mmol/L is division by 18.018.

Variable	US site record (collected in mg/dL)
`LBORRES`	95
`LBORRESU`	mg/dL
`LBSTRESN`	5.27 (95 ÷ 18.018)
`LBSTRESU`	mmol/L
`LBNRLO` as reported by lab	70 mg/dL
`LBNRHI` as reported by lab	100 mg/dL
`LBSTNRLO` — wrong	70 (not converted)
`LBSTNRHI` — wrong	100 (not converted)
`LBNRIND` derived	LOW (5.27 < 70 in the comparison)

A glucose of 5.27 mmol/L is squarely within the normal range (approximately 3.9–6.1 mmol/L). But the range stored in LBSTNRLO is 70 and the stored unit is mmol/L — producing a comparison of 5.27 against 70, which flags the record as LOW. That flag travels into every safety table and shift table built from LBNRIND. Nothing in P21 catches it.

The correct derived ranges for this record are 70 ÷ 18.018 = 3.88 mmol/L and 100 ÷ 18.018 = 5.55 mmol/L. The conversion factor used for LBSTNRLO and LBSTNRHI must be identical to the one used for LBSTRESN. Same metadata table, same factor, same rounding rule.

Variable	Wrong	Correct
`LBSTRESN`	5.27	5.27
`LBSTRESU`	mmol/L	mmol/L
`LBSTNRLO`	70 (mg/dL scale, unconverted)	3.88 (mmol/L, converted)
`LBSTNRHI`	100 (mg/dL scale, unconverted)	5.55 (mmol/L, converted)
`LBNRIND`	LOW (wrong)	NORMAL (correct)

Important: The value 5.27 is correct in both cases. The failure is not the numeric result — it is that the reference ranges were not converted to match the standard unit. This breaks LBNRIND even when the result itself is correct.

SDTMIG also requires that sponsors document in Define.xml comments whether LBNRIND is based on original ranges or standardized ranges. This is explicit guidance, not optional. If you derive LBNRIND from standardized ranges — which is the cleaner and more consistent approach — state that. If you derive from original lab ranges, state that too. I have reviewed submissions where 20–30% of LBNRIND flags were wrong for exactly the reason shown above. Nothing in P21 caught it. A reviewer running basic summary statistics was the one who found it.

Check to run: for every record where LBORRESU ≠ LBSTRESU, verify that LBSTNRLO and LBSTNRHI are numerically consistent with the standard unit — not the original unit. If your LBSTRESN values are in the range 3–7 but your LBSTNRHI values are in the range 70–110, you have unconverted reference ranges.

04 Forcing qualitative results into LBSTRESN

SDTMIG is unambiguous: LBSTRESN holds the numeric portion of the standardized result. When the standardized result is qualitative — POSITIVE, NEGATIVE, NORMAL, ABNORMAL, TRACE, 1+, 2+, 3+ — there is no numeric portion. LBSTRESN must be null. The result goes into LBSTRESC in character form and stays there.

The pressure to encode ordinal results numerically in SDTM usually comes from analysts who want to sort or compute on the scale downstream. That is a legitimate need, but it belongs in ADaM — typically as a derived numeric variable in ADLB — not in the SDTM submission dataset. Populating LBSTRESN with 0 for NEGATIVE and 1 for POSITIVE, or with 0.5 for TRACE and 1 for 1+ and 2 for 2+, introduces a numeric encoding that SDTMIG does not support and that Pinnacle 21 will flag as SD0086 or similar depending on the ruleset.

LBORRES	LBSTRESC	LBSTRESN — wrong	LBSTRESN — correct
NEGATIVE	NEGATIVE	0	. (null)
TRACE	TRACE	0.5	. (null)
1+	1+	1	. (null)
2+	2+	2	. (null)
3+	3+	3	. (null)

The exception is a BLQ result that includes a numeric boundary. A result reported as <0.10 ng/mL is not purely qualitative — it has a numeric component. In this case, the boundary value belongs in LBSTRESN and the operator stays in LBSTRESC. That is the inequality pattern, covered in Mistake 5. The key distinction is whether the result has a numeric boundary at all. If LBORRES is the string BLQ with no numeric component attached, LBSTRESN is null. If LBORRES is <0.10, the boundary 0.10 goes into LBSTRESN.

/* Qualitative — LBSTRESN must be null */
LBORRES   = NEGATIVE
LBSTRESC  = NEGATIVE
LBSTRESN  = .

/* Ordinal — LBSTRESN must be null */
LBORRES   = 2+
LBSTRESC  = 2+
LBSTRESN  = .

/* BLQ with numeric boundary — inequality pattern applies */
LBORRES   = <0.10
LBSTRESC  = <0.10
LBSTRESN  = 0.10
LBSTRESU  = ng/mL

05 Losing the operator on inequality results

When a lab returns a result as <0.10, that full string is the result. The operator is not decoration — it is part of the scientific meaning. A value of <0.10 says the measurement was below the limit of detection and the true value is somewhere below the numeric boundary. Stripping the operator and storing 0.10 in LBSTRESC converts an undetected result into a detected one. That changes the data.

The SDTMIG model for inequality results is: the operator-qualified string goes into LBSTRESC, and the numeric boundary goes into LBSTRESN. When unit conversion is also needed, the conversion applies to the numeric boundary — and the converted operator string goes into LBSTRESC. Both variables must reflect the post-conversion boundary, not the original.

Here is a full worked example. Glucose is reported as <2.0 mmol/L — below the limit of detection for the assay. The sponsor standard unit is mg/dL, using a conversion factor of 18.018.

Variable	Wrong	Correct	Note
`LBORRES`	<2.0	<2.0	Verbatim — same in both cases
`LBORRESU`	mmol/L	mmol/L	Original unit — same in both cases
`LBSTRESC`	36.04	<36.04	Operator dropped in wrong version
`LBSTRESN`	36.04	36.04	Numeric boundary — same in both
`LBSTRESU`	mg/dL	mg/dL	Standard unit — same in both cases

The wrong version stores the converted numeric value in LBSTRESC without the operator. To a reviewer reading that record, glucose is 36.04 mg/dL — a detected value near the low end of the normal range, not a below-detection result. That is a materially different clinical statement.

A second failure mode is when the conversion is applied to the operator string correctly, but LBSTRESC and LBSTRESN are derived from two independent computations rather than one, causing a floating-point mismatch. The safe pattern is to compute the rounded LBSTRESN first, then derive LBSTRESC from that rounded value:

/* Compute once, derive twice — safe pattern */
lbstresn = round(2.0 * 18.018, 0.01);        /* = 36.04 */
lbstresc = cats('<', strip(put(lbstresn, best12.)));  /* = "<36.04" */

/* Independent computation — unsafe pattern */
/* lbstresc = cats('<', put(2.0 * 18.018, best12.)); */
/* lbstresn = round(2.0 * 18.018, 0.01);              */
/* These can diverge due to floating-point representation */

Edge case: some lab systems encode above-range results as >5000 and some as the plain value 5000 with a separate high flag. If LBORRES contains 5000 without an operator, LBSTRESC and LBSTRESN both take the value 5000. Do not infer an operator from an associated flag variable — only use an operator in LBSTRESC when it is present in the source result.

06 Allowing more than one LBSTRESU per LBTESTCD

The invariant is simple: for a given LBTESTCD, there is one standard unit. Every record for that test code — regardless of which site collected it, which lab analyzed it, or when it was collected — must have the same value in LBSTRESU. When that is violated, the standardized layer is not standardized. Analysis results computed across subjects become numerically mixed.

This mistake happens in three specific situations. The first is unit pass-through: the programmer assigns LBSTRESU directly from LBORRESU without going through a conversion metadata table, assuming all sites report in the same unit. When they do not, the assumption introduces the inconsistency silently. The second is a mid-study lab change: the central lab upgrades its reporting system and begins sending TSH in mIU/mL instead of µIU/mL. These units are numerically equivalent — the conversion factor is 1 — so no result values change, but the unit string in LBSTRESU now varies within the same test code. P21 does not flag this. The third is a new local lab joining mid-study with a unit that was not anticipated in the original mapping spec, and the programmer adds a conditional branch to handle it rather than updating the metadata table.

USUBJID	VISIT	LBTESTCD	LBORRES	LBORRESU	LBSTRESN	LBSTRESU
US-001	Week 4	TSH	2.1	uIU/mL	2.1	uIU/mL
EU-007	Week 4	TSH	2.1	uIU/mL	2.1	uIU/mL
US-001	Week 24	TSH	2.3	mIU/mL	2.3	mIU/mL
EU-007	Week 24	TSH	2.4	mIU/mL	2.4	mIU/mL

The highlighted rows show Week 24 records where the lab changed its reporting unit string. The numeric values are correct — 1 mIU/mL equals 1 µIU/mL — but LBSTRESU now contains two different strings for the same LBTESTCD. Pinnacle 21 validates LBSTRESU against controlled terminology. It does not enforce a single unit per LBTESTCD.

The fix is architectural. LBSTRESU is assigned from a conversion metadata table keyed on LBTESTCD, not mapped directly from LBORRESU. The table says TSH → uIU/mL. It does not matter what string the lab sends in LBORRESU. The standard unit is always what the metadata says. When a unit change happens at the lab, the metadata table is updated through change control and the unit string is harmonized.

/* QC check: flag any LBTESTCD with multiple LBSTRESU values */
proc freq data=lb noprint;
  tables lbtestcd * lbstresu / out=unit_check;
run;

data unit_fail;
  set unit_check;
  by lbtestcd;
  if not (first.lbtestcd and last.lbtestcd);
run;

/* Any rows in unit_fail = submission-blocking issue */
proc print data=unit_fail; run;

07 Treating alternate units as a late-stage decision

This is the structural mistake. The others are implementation errors. This one is a planning failure that cannot be corrected cleanly at database lock.

For FDA clinical submissions, the current Study Data Technical Conformance Guide v6.1 (December 2025) is explicit: submit two lab domains. LB holds SI-standardized results in LBSTRESU, LBSTRESC, and LBSTRESN. LC is a custom domain structured identically to LB, carrying conventional-unit results in the corresponding –STRESU, –STRESC, and –STRESN fields. The guidance also states that the ideal source for both SI and conventional unit values is the lab vendor itself — not post-hoc conversion in SAS.

That last point is operationally significant. If the vendor does not provide both unit systems, you are computing the alternate unit yourself. That conversion has to be governed, documented, and validated. A post-hoc conversion that is not in the original study spec, not in Define.xml, and not reviewed by the QC programmer is not defensible under audit. The time to contract for dual-unit vendor output is during study design, not at lock.

Approach	Status	LB content	Alternate unit
SUPPLB workaround	Outdated for FDA clinical	Sponsor-chosen unit	Buried in SUPPQUAL — poor reviewer access
Single domain, one unit	Outdated for FDA clinical	SI or conventional — one only	Not submitted
LB + LC (SDTCG v6.1)	Current requirement	SI units in LBSTRESU	Conventional units in parallel LC domain

LC is not a SUPPQUAL extension. It is a full parallel dataset with its own metadata in Define.xml, its own LCSEQ sequencing variable, and its own entry in the Study Data Reviewer's Guide. A team discovering at lock that FDA expects LC — and that the lab vendor was not contracted to provide conventional units — is looking at a programmatic conversion workstream, a Define.xml rewrite for an additional domain, and an SDRG update, all under submission timelines. That is avoidable with an early planning conversation.

PMDA's requirements differ in one important way. PMDA expects SI-standardized results in the submission and additionally requires that when original and converted values coexist in the dataset, the conversion equation must be documented in reviewer-facing materials — explicitly, not by reference. That means the SDRG or Define.xml Comments for LBSTRESN and LBSTRESC should state the factor used. "Converted from LBORRESU to LBSTRESU" is not enough. The equation or factor must be present.

Global submission design rule: if a study will be submitted to both FDA and PMDA, design around SI as the universal standard for LB from the start. Produce LC for conventional units to satisfy FDA's parallel-domain requirement. Document the conversion equation in Define.xml and the SDRG to satisfy PMDA. These two requirements are compatible — they just both need to be planned for.

What good implementation looks like

A defensible LB conversion setup rests on a single foundational design decision: conversion logic must live in a governed metadata table, not in SAS code. If your conversion logic lives in SAS code instead of a governed metadata table, you do not have a controlled process. You have an implementation. That distinction matters when a new local lab joins mid-enrollment, when a factor needs to be corrected, when a second programmer tries to independently reproduce the derivation, or when a regulatory reviewer asks how a specific result was computed.

The metadata table has one row per LBTESTCD and source unit. It contains the standard unit, the conversion factor, the source of that factor (JCTLM, NIST SP 811, or lab vendor specification), and the rounding precision. Every derivation — result, reference range, and flag — flows from the same table. When the table changes, every downstream derivation changes consistently. That is a controlled process.

/* Example conversion metadata — one row per LBTESTCD × source unit */
data lb_conv_meta;
  length lbtestcd $8  from_unit std_unit $40  source $60;
  input lbtestcd $ from_unit $ std_unit $ factor round_to source $;
  datalines;
CREAT   mg/dL    umol/L   88.4200  0.1   JCTLM
CREAT   umol/L   umol/L    1.0000  0.1   Pass-through
GLUC    mg/dL    mmol/L    0.0555  0.01  JCTLM
GLUC    mmol/L   mmol/L    1.0000  0.01  Pass-through
HGB     g/dL     g/L      10.0000  0.1   JCTLM
HGB     g/L      g/L       1.0000  0.1   Pass-through
TSH     uIU/mL   mIU/L     1.0000  0.001 Unit-equiv
TSH     mIU/mL   mIU/L     1.0000  0.001 Unit-equiv
;
run;

The derivation itself follows a fixed sequence: round LBSTRESN first, then derive LBSTRESC from the rounded value. Never derive LBSTRESC from the raw computation — floating-point residuals will produce character values that do not match LBSTRESN, generating a P21 finding. The same factor and rounding rule that produces LBSTRESN must also produce LBSTNRLO and LBSTNRHI.

QC checks that matter

Standard conformance validation covers the basics — variable presence, type, codelist membership. It does not cover cross-variable consistency, conversion accuracy, or the structural requirement for the LC domain. These checks need to run as part of your standard SDTM QC suite, not as a one-time review.

The first check is LBSTRESU consistency per LBTESTCD across the full dataset. A single discrepant record passes P21 and fails the analysis. The second is LBSTRESC and LBSTRESN agreement — when LBSTRESN is populated, the numeric portion of LBSTRESC must match it within rounding tolerance. The third is qualitative results with LBSTRESN populated. The fourth is inequality operator preservation — comparing the operator present in LBORRES against what appears in LBSTRESC. The fifth is reference range unit consistency — checking that LBSTNRLO and LBSTNRHI are numerically plausible relative to LBSTRESN and LBSTRESU. The sixth is a factor-based reasonableness check: for each converted record, multiply LBORRES by the expected factor and compare to LBSTRESN. If the difference exceeds the rounding tolerance, the factor is wrong or was applied incorrectly.

P21 does not check: LBSTRESU consistency within a LBTESTCD. Conversion accuracy. Reference range unit plausibility. Whether LC is present when FDA clinical submission requires it. These are your responsibility. If your QC plan relies entirely on P21, it is incomplete.

Most production LB issues are not single-variable problems. They are mismatches between variables that individually look valid.

This is why datasets that pass P21 still fail review.

Bottom line

LB unit conversion fails in the details. A result converts correctly while the reference range does not. A standard unit is assigned directly from the source rather than from controlled metadata, and three records slip through with the wrong LBSTRESU value. A qualitative result gets a numeric encoding for analysis convenience that does not belong in SDTM. An operator disappears during parsing and a boundary value becomes a detected result.

None of these failures are dramatic. They do not crash anything. They travel silently into Define.xml, into ADaM derivations, into safety tables, and into submission review — where a reviewer with a summary statistics macro finds them in a way that is much harder to explain than if they had been caught in QC.

The model is simple enough to keep on one line: LBORRES is collected truth, LBSTRES* is the standardized representation, and LBSTNR* follows the same unit system. Every mistake on this list is a version of breaking that contract. Keep the contract. Run the checks. Make the LB/LC structure decision before lock, not at lock.

Sources

[1] CDISC SDTMIG v3.4 — LB domain variable definitions: LBORRES, LBORRESU, LBSTRESC, LBSTRESN, LBSTRESU, LBSTNRLO, LBSTNRHI, LBNRIND. LBNRIND documentation requirement in Define.xml. cdisc.org/standards/foundational/sdtmig

[2] FDA Study Data Technical Conformance Guide v6.1, December 2025 — LB/LC two-domain requirement for clinical submissions, traceability expectations. fda.gov/media/153632/download

[3] CDISC Knowledge Base — Standardized Lab Units: FDA expects SI units in LB domain; conventional units in custom LC domain. cdisc.org/kb/articles/standardized-lab-units

[4] PharmaSUG China 2025 (DS175) — "Bridging FDA and SI Unit Requirements through LB and LC." Verbatim SDTCG 6.0 mandate and LC domain structure. lexjansen.com

[5] PMDA Q&A, provisional English translation, March 2025 — SI unit requirement, documentation of conversion equation when original and converted values coexist.

[6] JCTLM (Joint Committee for Traceability in Laboratory Medicine) — Recommended conversion factors for clinical chemistry analytes. jctlm.org

[7] CDISC CT, LB Units codelist C71620 — NCI EVS. evs.nci.nih.gov

Tags: SDTM LB Domain LC Domain LBORRES LBSTRESC LBSTRESN LBSTRESU LBSTNRLO LBNRIND FDA PMDA Pinnacle 21 Define.xml SAS

SDTM IG 3.4: SR, OE, LB Domain Expansion, and Revised SUPPDS Assumptions

2026-04-09T16:19:00.003-04:00

SDTM IG 3.4: SR, OE, LB Expansion, and Revised SUPPDS Assumptions | StudySAS Blog

SDTM IG 3.4 was published by CDISC in July 2022 alongside SDTM v2.0.^[1] This post focuses on what actually changes programming and submission behavior — written for SDTM programmers and submission leads. If you have been working from 3.2 or 3.3, several of these changes will require updates to domain routing logic, controlled terminology mappings, Define.xml metadata, and SUPPDS population rules that have been stable in company standards for years. This post examines four areas where working programmers face concrete re-mapping decisions: the SR domain and how the IS scope expansion in 3.4 redraws its boundaries, the OE domain as the post-MO home for ophthalmic data, the restructured LB with confirmed new variables and a substantially narrowed scope, and the SUPPDS assumptions v3.4 quietly rewrote.

One terminology correction before we start. Several posts and internal notes reference an "ON" domain in the context of SDTM IG 3.4. There is no domain with the two-letter code ON in any version of the SDTMIG. The ophthalmic domain code is OE — Ophthalmic Examinations — introduced in v3.3. Section 9 of v3.4 introduced OI (Non-Host Organism Identifiers) as a Study Reference dataset, not a clinical observations domain. This post uses OE throughout.

SR — Skin Response Domain

Introduced v3.2 Routing boundaries defined v3.4

SR was introduced in SDTM IG 3.2 as one of eleven new domains in that release.^[2] It is a Findings Sub-Class domain — not a general Findings domain and not a Findings About domain.^[3] One record per test per visit per subject. It captures dermal responses to antigens typically assessed from skin-prick or intradermal challenge: wheal diameter, erythema diameter, induration size. The structure did not change in v3.4. What changed is the precision of the boundary condition that determines whether data belongs in SR at all.

The Three-Way Routing Decision Formalized in v3.4

The IS scope expansion in v3.4 is the driver here. Under v3.2 and v3.3, the IS domain was scoped narrowly to data from assessments describing whether a study therapy provoked an immune response. Specimen-based immune testing in other contexts ended up in LB or MB depending on version. In v3.4, IS is redefined to cover all specimen-based assessments that measure the presence, magnitude, and scale of an immune response upon any antigen stimulation or encounter — not restricted to study therapy.^[4]

That expansion creates a three-way routing that v3.4 defines explicitly for allergy and vaccine programs.^[4] First split: is the immune response specimen-based or surface-based? Specimen-based data — anything measured from a collected sample: serum IgE levels, antibody titers from blood or plasma, cellular immune assays from PBMCs — goes to IS. Surface-based responses split further on intent. A wanted, expected localized dermal response to a substance administered to provoke that response goes to SR. A tuberculin PPD skin test wheal is SR. An allergen skin prick test wheal-and-flare is SR. An unwanted, symptomatic allergic reaction — injection-site erythema coded as an adverse event, vaccine reactogenicity — goes to AE or CE.

Routing decision for dermal and immune data under v3.4 [Source: CDISC KB Article, IS Domain Scope Update]

Specimen-based immune response (serum IgE, antibody titers, ELISPOT, neutralization assays) → IS
Localized surface response, wanted and expected (PPD induration, allergen prick wheal) → SR
Localized surface response, unwanted or symptomatic → AE or CE

SR Domain Structure

The variables that programmers most often underuse or misuse in SR: SRANTREG captures the anatomical region of the test application site — not the general body area. SRLAT (laterality) must be populated for bilateral comparisons. SRCAT captures test category (ALLERGEN PANEL, TUBERCULIN) and SRSCAT the subcategory. When a protocol tests multiple antigen concentrations on the same visit, each concentration generates a separate record. The structure below shows a standard bilateral allergy panel with one NOT DONE record handled correctly:

/* SR — Bilateral allergen panel, one NOT DONE record */

STUDYID  DOMAIN  USUBJID         SRSEQ  SRTESTCD  SRTEST             SRORRES  SRSTRESC  SRSTRESN  SRSTRESU
-------- ------  --------------- -----  --------  -----------------  -------  --------  --------  --------
ABC-001  SR      ABC-001-001-01  1      WHEAL     Wheal Diameter     12       12        12        mm
ABC-001  SR      ABC-001-001-01  2      ERYTHEMA  Erythema Diameter  25       25        25        mm
ABC-001  SR      ABC-001-001-01  3      WHEAL     Wheal Diameter     (null)   (null)    (null)    (null)

         SRSTAT    SRREASND        SRANTREG      SRLAT   SRDTC        SRDY  VISIT
         --------  --------------  ------------  ------  -----------  ----  ------
         (null)    (null)          LEFT FOREARM  LEFT    2023-03-15   8     WEEK 4
         (null)    (null)          LEFT FOREARM  LEFT    2023-03-15   8     WEEK 4
         NOT DONE  SUBJECT REFUSED RIGHT FOREARM RIGHT   2023-03-15   8     WEEK 4

What v3.4 Changed for SR in Practice

If your program was routing anti-allergen IgE antibody titers into LB — which was correct under v3.2 and v3.3 for pre-exposure baseline data — those records move to IS under v3.4.^[4] SR itself is structurally unchanged, but its boundary with IS is now formally defined in the IG rather than left to interpretation. FDA reviewers are expected to confirm appropriate domain selection when allergy or vaccine data is present in a submission, so your Reviewer's Guide should document the routing decision explicitly.

Baseline split problem: Under v3.2/v3.3, a common pattern was to put pre-treatment baseline antibody measurements in LB (or MB under 3.3) and post-exposure measurements in IS, because IS was restricted to therapy-induced responses. This creates a baseline/post split across domains that makes ADaM baseline flagging and change-from-baseline derivation fragile. Under v3.4, all antigen-induced antibody data — including pre-exposure baseline if the antigen in question is the study treatment or allergen — belongs in IS. Baseline records should be anchored by VISITNUM and EPOCH context within IS, not by a separate domain. Programs transitioning from 3.3 to 3.4 must document this split explicitly in their Reviewer's Guide.

OE — Ophthalmic Examinations Domain

Introduced v3.3 MO decommissioned v3.4 Code is OE, not ON

OE was introduced in SDTM IG 3.3 as one of several body-system Findings domains developed through TAUG work.^[5] It was not new in v3.4. What changed in v3.4 is that the MO (Morphology) domain was formally decommissioned, and the IG explicitly directs sponsors to use body-system domains — OE among them — rather than MO for morphological findings data.^[1] For ophthalmology programs, this means OE is now the only compliant home for both functional and morphological eye data under v3.4.

What Goes in OE

OE captures structured findings from ophthalmic assessments: visual acuity by Snellen chart or ETDRS letter score, intraocular pressure by Goldmann applanation tonometry or non-contact methods, slit-lamp biomicroscopy findings, fundoscopy results, optical coherence tomography retinal thickness measurements, visual field parameters, color vision test results, and corneal topography. Quantitative and categorical results from any structured ophthalmic examination go in OE.

The Laterality Requirement

This is the most consistent error in OE datasets at submission. OELAT is Expected, not Permissible. Every record must document whether the finding applies to the LEFT eye, RIGHT eye, or BILATERAL. When an assessment is bilateral and results are collected per eye, two records are required — one per laterality value. Reviewers use OELAT to reconstruct the patient-eye-visit trajectory across timepoints. Automated review tools cannot do this when OELAT is missing or inconsistently populated. The example below shows a correctly structured bilateral baseline assessment:

/* OE — Bilateral baseline: visual acuity and IOP, one eye per record */

STUDYID  DOMAIN  USUBJID         OESEQ  OETESTCD  OETEST                 OELAT   OELOC
-------- ------  --------------- -----  --------  ---------------------  ------  -----
XYZ-002  OE      XYZ-002-005-03  1      VATOTSC   Visual Acuity Total    LEFT    EYE
XYZ-002  OE      XYZ-002-005-03  2      VATOTSC   Visual Acuity Total    RIGHT   EYE
XYZ-002  OE      XYZ-002-005-03  3      IOP       Intraocular Pressure   LEFT    EYE
XYZ-002  OE      XYZ-002-005-03  4      IOP       Intraocular Pressure   RIGHT   EYE

         OEORRES  OESTRESC  OESTRESN  OESTRESU  OEMETHOD  OEDTC        VISIT
         -------  --------  --------  --------  --------  -----------  --------
         72       72        72        letters   ETDRS     2023-06-01   BASELINE
         68       68        68        letters   ETDRS     2023-06-01   BASELINE
         14       14        14        mmHg      GAT       2023-06-01   BASELINE
         16       16        16        mmHg      GAT       2023-06-01   BASELINE

OELOC captures the anatomical location within the eye — CORNEA, RETINA, MACULA, OPTIC DISC — and becomes essential when multiple findings from different eye structures are collected at a single visit. OEMETHOD distinguishes assessment methods, which matters for IOP where Goldmann applanation, non-contact tonometry, and iCare rebound produce systematically different values with different reference ranges.

ETDRS vs Snellen and the STRESC Problem

Ophthalmic trials frequently collect Snellen notation on the CRF while the protocol specifies ETDRS letter equivalents as the analysis variable. These are not interchangeable. OEORRES should capture what the CRF collected — the Snellen fraction if that is what was entered. OESTRESC and OESTRESN hold the standardized numeric result. If the CRF collected 20/40 Snellen and you are putting 50 in OESTRESN, that conversion must be documented in the Define.xml Comments column and explained in the Reviewer's Guide, including which reference conversion chart was used. OETESTCD must also distinguish method: VATOTSC with OEMETHOD = ETDRS is a different concept from VATOTSC with OEMETHOD = SNELLEN. These should not share a single TESTCD if their results are not interchangeable for analysis.

MO Decommissioning and Backward Compatibility

The IG recommends that sponsors submitting under v3.4 use body-system domains and leave MO behind entirely — even if earlier studies in the same program mapped to MO.^[1] For programs with ophthalmology data across multiple studies where some were submitted under v3.3 (OE) and earlier studies used MO, the Reviewer's Guide must document this version-based domain difference. The ADaM programmer needs to know the source domain for each study when tracing derivations back to tabulation data.

Note on OE in v3.4: OE has no new variables added in v3.4 itself. It entered the SDTMIG through v3.3. What v3.4 changed is the formal decommissioning of MO, removing the alternative that some sponsors were still using for morphological findings. Under v3.4, OE is the only compliant domain for ophthalmic observation data.

LB — Laboratory Test Results: Confirmed New Variables and Narrowed Scope

New variables in v3.4 ~400 CT terms migrate to IS

LB in 3.4 changed in two directions at once: it gained new variables designed to decompose complex assay metadata, and it lost a large category of data to IS. The CDISC IG documentation states that LB was updated to include 10 new variables in v3.4.^[1] The complete list requires CDISC Library access. The variables below are confirmed from publicly available CDISC sources.^[1][4][6]

Confirmed New Variables in LB for v3.4

Root Variable	LB-Prefixed Name	Label	Role	Shared With	Primary Purpose
`TSTCND`	`LBTSTCND`	Test Condition	Variable Qualifier	IS, CP	Stimulus condition under which the test was performed (e.g., EX VIVO STIMULATION, UNSTIMULATED)
`CNDAGT`	`LBCNDAGT`	Test Condition Agent	Variable Qualifier	IS, CP	Specific agent used to create the test condition (e.g., LPS, PHA-M, PHYTOHAEMAGGLUTININ)
`BDAGNT`	`LBBDAGNT`	Binding Agent	Variable Qualifier	IS, CP	Binding target or detection agent (antibody, probe) used in the assay — separates analyte from detection method
`TSTOPO`	`LBTSTOPO`	Test Operational Objective	Variable Qualifier	IS	What the test is operationally designed to accomplish: DETECTION, QUANTIFICATION, CHARACTERIZATION
`TSTDTL`	`LBTSTDTL`	Test Detail	Variable Qualifier	IS	Additional granularity beyond --TEST — distinguishes assay variants that share a TESTCD (e.g., NT50, NT80, PRNT50)
`LBCOLSRT`	domain-specific	Collection Sort Order	Variable Qualifier*	LB only	Ordering of collection records when multiple tubes are taken at a single timepoint; role corrected by errata
`LBLOINC`	domain-specific	LOINC Code	Record Qualifier†	CP, MB, MS (parallel)	Formal LOINC mapping for the test — role corrected by errata from Synonym Qualifier to Record Qualifier

* LBCOLSRT was published with role "Record Qualifier" and corrected by errata to "Variable Qualifier."^[1-errata] † LBLOINC was published with role "Synonym Qualifier" and corrected by errata to "Record Qualifier."^[1-errata]

Why These Variables Were Added: The TESTCD Overloading Problem

Under v3.2 and v3.3, a complex immunogenicity assay had to compress all meaningful metadata into LBTESTCD (max 8 characters) and LBTEST (max 40 characters) — resulting in NCI C-codes substituted for readable TESTCDs, truncated TEST values, and supplemental qualifiers created to carry context that had no standard home.^[6] v3.4 post-coordinates this into structured variables. The same test now looks like this:

/* v3.2/3.3 — pre-coordinated, truncated, supplemental qualifier required */
LBTESTCD = NRSVIGG            /* NCI C-code used because human-readable name exceeds 8 chars */
LBTEST   = Neut. Respirat. Syncytial Virus IgG NT50*  /* truncated to fit 40-char limit */
/* SUPPRS: QNAM = "50% NEUTRALIZATION TITER", QVAL = "PRNT" */

/* v3.4 — post-coordinated across structured variables, IS domain */
ISTESTCD = MBIGGNAB           /* Neutralizing Microbial-induced IgG Antibody */
ISTEST   = Neutralizing Microbial-induced IgG Antibody
ISBDAGNT = RESPIRATORY SYNCYTIAL VIRUS
ISTSTDTL = 50% NEUTRALIZATION TITER
/* No supplemental qualifiers needed */

LBTESTCD remains human-readable under v3.4. Context that was pre-coordinated into compound test names now has its own standard variable home, and SUPP-- qualifiers created to carry BINDING AGENT, CONDITION, or ASSAY VARIANT information can be retired.

The LBLOINC Errata: Define.xml Impact

This errata entry has a direct consequence in Define.xml that is easy to miss. LBLOINC was published in the original v3.4 release with the role of Synonym Qualifier.^[1-errata] The errata corrects it to Record Qualifier. In Define.xml, Synonym Qualifier variables are treated as alternate labels for the topic variable — they do not carry independent origin, method, or codelist metadata. Record Qualifiers do. If your Define.xml for LB was templated against the originally published role assignment, the metadata for LBLOINC is wrong. FDA review tools validate against the model metadata. Fix it before submission. The same errata applies to CPLOINC, MBLOINC, and MSLOINC — all had the same role error and the same correction.

The Scope Contraction: ~400 CT Terms Leave LB for IS

This is the part that breaks existing lab standards programs. Under v3.4, LB no longer contains most immune response assessments or non-host organism tests.^[6] Approximately 400 antibody-related TESTCD and TEST controlled terminology values are deprecated from LB (and MB) and remodeled in IS using IS domain standard variables including ISTESTCD, ISBDAGNT, and ISTSTDTL.^[6] CDISC publishes a mapping file updated quarterly to help sponsors map deprecated concepts to their new post-coordinated IS equivalents.

What LB retains: clinical chemistry, hematology, urinalysis, coagulation, PK-supporting bioassays, and autoantibodies driven by pre-existing conditions not related to antigen stimulation. The boundary test is: is this measuring an immune response to an antigen? If yes, IS. If it is a general biochemical or hematological measurement, LB.

SAS standards impact: Any master LBTESTCD library that was built against v3.2 or v3.3 controlled terminology includes test codes that no longer belong in LB under v3.4. If your routing macro assigns domain based on LBTESTCD match against that library, it will continue to put immune response data in lb.xpt incorrectly. The library needs to be audited against the CDISC deprecation mapping file. Test codes flagged for migration need to be removed from the LB library and remapped to IS controlled terminology. Any SUPP-- qualifiers that were created to carry binding agent or test condition context for those records can be retired if the data moves to IS standard variables.

SUPPDS — Three Revised Assumptions in v3.4

SUPPDS

Assumptions revised v3.4

Three changes in v3.4 directly affect what goes in SUPPDS and what does not. Two of them prohibit things that were common practice under v3.2 and v3.3. One removes a restriction that was forcing inaccurate DS modeling.

Population Flags Are No Longer SDTM Data

This is stated explicitly in v3.4 DM domain assumptions: population flags — COMPLT, FULLSET, ITT, PPROT, and SAFETY — should not be included in SDTM data.^[7] Under v3.2 and v3.3 guidance, many sponsors included these in SUPPDM, on the reasoning that they were subject-level qualifiers derivable from disposition and protocol data. v3.4 closes that. These values are ADaM population derivation outputs. When they appear in both SDTM and ADaM, reviewers cannot establish the direction of derivation — whether the flag drove the analysis or was derived from it. The circular traceability dependency is the problem. If your SUPPDM template includes population flags, that template needs to change for any v3.4 submission.

EPOCH Restriction for PROTOCOL MILESTONE Records Removed

Under earlier guidance, the DS domain assumption was that EPOCH should not be populated when DSCAT = "PROTOCOL MILESTONE".^[7] The logic was that milestones like Informed Consent and Screen Failure are not tied to a treatment epoch. In practice this prevented accurate representation of programs with multiple consent events — re-consent procedures, optional substudy consent, adaptive design consent updates — where the EPOCH context was needed to distinguish which record represented which consent event within which trial phase. v3.4 removes this restriction. EPOCH may now be populated for PROTOCOL MILESTONE records. If you use this capability, your SE and TV domains must define and include the EPOCH values you reference in DS.

DSSCAT: Formalizing the Treatment vs Participation Split

v3.3 formalized the use of DSSCAT to distinguish study treatment disposition from study participation disposition. v3.4 reinforces this structure and clarifies the SUPPDS boundary that results from it.^[7] The correct architecture is: DSCAT holds the high-level category (DISPOSITION EVENT, PROTOCOL MILESTONE). DSSCAT distinguishes study treatment disposition from study participation disposition within each DSCAT category. When a subject discontinues treatment but continues in follow-up, that is two DS records — one with DSSCAT = "STUDY TREATMENT DISPOSITION" and one with DSSCAT = "STUDY PARTICIPATION DISPOSITION".

SUPPDS is appropriate when additional context cannot be captured in standard DS variables — a free-text withdrawal reason when DSDECOD controlled terminology is insufficient, or a sponsor-specific sub-reason below the level of standard coding. It is not appropriate for data that could go in DSCAT, DSSCAT, or DSDECOD with the correct controlled terminology applied. The more common audit finding is sponsors using SUPPDS for data that belongs in DSSCAT or as a DSDECOD value.

Define.xml VLM implication: If DSSCAT splits treatment and participation disposition, the Value Level Metadata for DSDECOD should reflect different CDISC controlled terminology codelists for each DSSCAT value. The DS controlled terminology codelist has separate value sets for study treatment outcomes and study participation outcomes. Applying the general DSDECOD codelist without VLM differentiation by DSSCAT is a common P21 finding in Define.xml review. Build the VLM correctly from the start — do not treat DSDECOD as having a single codelist applicable to all DS records.

What Still Belongs in SUPPDS

With population flags prohibited and DSSCAT formalized, the legitimate use cases for SUPPDS narrow considerably. SUPPDS is appropriate for: protocol deviation sub-reason text that does not map to standard DSDECOD values, sponsor-specific categorization of withdrawal reason at a level below CDISC controlled terminology, and date-level detail for milestones where a second date variable is needed and no standard variable exists. It is not appropriate for data that has a standard DS variable — including DSSCAT, EPOCH, and DSCAT — even if those variables have not been used historically in your company standards.

Transition Checklist for v3.4

For each of the areas covered, the practical re-mapping work is specific and auditable.

For SR: audit any allergy or vaccine program where antibody titer data was routed to LB at baseline and IS post-exposure. Under v3.4, antigen-induced antibody records should all be in IS regardless of collection timing relative to study product exposure. Update or retire any SUPP-- qualifiers that were carrying immune response context that now has IS standard variable homes. Document the routing decision in the Reviewer's Guide explicitly.

For OE: if any prior studies in the same program mapped ophthalmic data to MO, document the version-based domain difference in the Reviewer's Guide. Audit OELAT population across all OE datasets — Expected variables with missing values are a P21 finding. Confirm that OETESTCD values distinguish assessment method when different methods produce non-interchangeable results. Verify that any Snellen-to-ETDRS conversions are documented in Define.xml Comments and the Reviewer's Guide.

For LB: run your master LBTESTCD library against the CDISC deprecation mapping file to identify the test codes that must migrate to IS. Remove those codes from LB routing logic and add them to IS controlled terminology mappings. Fix the LBLOINC role in your Define.xml metadata template from Synonym Qualifier to Record Qualifier. Also apply the same fix for CPLOINC, MBLOINC, and MSLOINC if those domains are in your submission. Evaluate the confirmed new variables — LBTSTCND, LBCNDAGT, LBBDAGNT, LBTSTOPO, LBTSTDTL — against your therapeutic area's existing SUPP-- qualifiers to determine whether any SUPPQUAL content can be retired in favor of standard variables.

For SUPPDS: remove population flags from any SUPPDM or SUPPDS template. Update VLM for DSDECOD to reflect separate controlled terminology codelists by DSSCAT value. Confirm that EPOCH is populated for PROTOCOL MILESTONE records where epoch context is meaningful for the study design. Audit SUPPDS content to confirm that each QNAM cannot be mapped to a standard DS variable under v3.4 assumptions.

None of this is optional for studies submitted under v3.4. FDA's Data Standards Catalog lists v3.4 as the current supported version. P21 Enterprise validation runs against v3.4 conformance rules. The time to update the standards is before the study starts, not during submission preparation. If your standards still reflect 3.2 or 3.3 assumptions, these are the areas that will surface first in review.

Sources and Verification

[1] CDISC. Study Data Tabulation Model Implementation Guide: Human Clinical Trials v3.4 (Final). July 2022. cdisc.org/standards/foundational/sdtmig/sdtmig-v3-4 — Primary source for all v3.4 domain specifications, errata (LBLOINC role, LBCOLSRT role, CPLOINC/MBLOINC/MSLOINC role corrections), and SUPPDS assumption changes.

[2] CDISC. SDTMIG v3.2. 2013. cdisc.org/standards/foundational/sdtmig/sdtmig-v3-2 — Source confirming SR introduced in v3.2 (with EC, PR, HO, DD, IS, MI, MO, RP, SS, TD); SR errata confirming Findings Sub-Class classification.

[3] PharmaSUG 2016, Paper DS04. Wittle et al. Moving up! — SDTM v3.2 — What is new and how to use it. — Confirms SR structure: one record per test per visit per subject; dermal responses to antigens from skin-prick assessments.

[4] CDISC Knowledge Base. IS Domain Scope Update for the SDTMIG v3.4: A Development History and the Difficulties of Standardizing Complicated Biological Processes. cdisc.org/kb/articles — Source for three-way routing decision (SR vs IS vs AE/CE), IS scope redefinition, and antigen definition in v3.4.

[5] CDISC 2024 China Interchange. Fan Yang. An In-Depth Analysis of the Updates and Challenges in SDTM IG 3.3 and 3.4. cdisc.org — Confirms OE (Ophthalmic Examinations) introduced in v3.3, not v3.4; MO decommissioned in v3.4.

[6] CDISC Webinar. Li et al. LB, MB & IS Domain Scope Changes for the SDTMIG v3.4 and Impact on Controlled Terminology. June 2023. cdisc.org (PDF) — Primary source for: confirmed new LB/IS/CP variables (BDAGNT, TSTCND, CNDAGT, TSTOPO, TSTDTL), TESTCD overloading rationale, ~400 CT term deprecations from LB/MB to IS, LB scope definition under v3.4.

[7] PHUSE-US 2024. Bheemagani et al. What's New in the SDTMIG v3.4 and the SDTM v2.0. lexjansen.com — Source for: population flags prohibition in SDTM (DM domain assumption), EPOCH restriction removal for PROTOCOL MILESTONE records, DSSCAT treatment vs participation split.

Tags: SDTM IG 3.4, SR domain, OE domain, LB domain, SUPPDS, CDISC, IS domain, immunogenicity, Define.xml, VLM, SDTM programming, regulatory submissions, FDA, controlled terminology, P21 validation, BDAGNT, TSTCND, LBLOINC errata, MO decommissioning

PMDA SDTM Submission Requirements vs FDA: Where They Actually Diverge

2026-04-07T10:49:00.002-04:00

Global teams often say, "PMDA and FDA both take SDTM, so one package should work for both." That is only half true.

At the model level, the overlap is real. Both agencies expect CDISC-based submissions, SAS XPT v5 transport files, define.xml, and reviewer-facing documentation. But the real question is not whether your SDTM is CDISC-compliant. The real question is this: will the same study package survive both agencies' rule engines, review workflows, and local review habits without late rework?

For teams that build for FDA first, the answer is often no, at least not without adjustment. The differences that matter most are not random. They cluster around standards catalog timing, PMDA pre-submission consultation, Japanese text handling, clinical pharmacology traceability, validation behavior, and Japanese date and time conventions.

Note: Before any real submission, always recheck the active PMDA Technical Conformance Guide, validation rules version, and accepted standards catalog on the PMDA site. These can change over time.

1. The common ground is real, but it is not the hard part

Both FDA and PMDA anchor clinical study data expectations in CDISC standards. Both expect SDTM and ADaM datasets in SAS XPORT Version 5 format. Both expect define.xml. Both expect reviewer-facing documentation.

That shared base is real, but it does not remove the operational differences. The trouble starts in the last mile, where agencies differ in validation behavior, local documentation expectations, accepted standards timing, and handling of Japanese source content.

2. Data standards catalog timing is not handled the same way

FDA and PMDA both publish accepted standards catalogs, but teams should not assume the timing logic is identical. FDA allows older supported standards in ways that are tied to study timing and catalog support windows. PMDA is more tightly anchored to what is accepted at the time of submission.

In practice, this means a study that is still acceptable to FDA under an older SDTMIG version may need a closer check for PMDA if the PMDA catalog in force at filing no longer lists that version. For pooled analyses built from studies that used different versions, PMDA also expects the differences and their effect on the integrated package to be explained clearly in the reviewer's guide.

Programming takeaway: Do not lock standards once at study start and assume the question is closed. Recheck PMDA standards close to filing.

3. PMDA pre-submission consultation is a real operational gate

FDA has meeting options, but PMDA's data-focused consultation process plays a much bigger role in how the package is prepared. In practice, sponsors are expected to go through formal PMDA data consultation before filing. This is where dataset structure, validation findings, and unresolved submission issues are discussed before the NDA reaches the gateway.

Form A and related consultation materials are not just paperwork. They force the sponsor to show what datasets will be submitted, what findings remain, and how each issue will be handled. That changes behavior upstream. Teams preparing a PMDA package cannot treat this as an FDA package with a regional note added at the end.

Warning: PMDA treats serious validation findings much more aggressively in practice. Reject-level findings are generally treated as blocking issues and usually must be fixed before review moves forward.

4. Japanese text handling is one of the biggest real differences

This is where many FDA-first teams get caught. PMDA allows Japanese data in submission datasets when translation would lose meaning, but it does not do this casually. It uses a paired dataset model.

When Japanese text must be preserved, the sponsor may need:

a standard alphanumeric dataset in sdtm or adam
a Japanese dataset in sdtm_j or adam_j
the same structure, same variables, same record order, and same record count in both
clear explanation in the reviewer's guide

The alphanumeric version may use a placeholder such as JAPANESE TEXT IN SOURCE DATABASE where the source value is carried in the Japanese dataset. The encoding used for the Japanese dataset also needs to be documented.

FDA has no parallel dataset folder model like this. That is one reason a single global SDTM package is often not really single.

/* Example: basic check before PMDA XPT export */
proc contents data=sdtm.ae;
run;

/* Review character variables and dataset encoding.
   If Japanese text is present and cannot be translated safely,
   plan paired submission datasets and document the handling. */

5. Clinical pharmacology submission expectations are tighter

For clinical pharmacology studies, PMDA expects stronger traceability between concentration data, derived parameters, and the analysis path used to produce them. In practice, this means the PP domain is expected alongside PC for clinical pharmacology work, and missing PP content is likely to raise conformance questions.

PMDA also expects clear linkage between PC and PP, typically through RELREC or equivalent documented traceability. If RELREC is not used, the reviewer guide should explain how individual parameters can be traced back to the source profile.

/* Illustrative RELREC concept linking PP back to PC */
/* RDOMAIN   USUBJID   IDVAR   IDVARVAL   RELTYPE   RELID */
/* PP        001-001   PPSEQ   1          ONE       PK01  */
/* PC        001-001   PCSEQ   1 2 3 4    MANY      PK01  */

For PK and PK/PD work, PMDA also puts more weight on analysis metadata and reproducibility than many FDA-first teams expect. This becomes even more visible when population PK or PBPK materials are involved.

6. PMDA puts more weight on submitted programs and reproducibility

FDA strongly values SDRG, ADRG, and clean metadata. PMDA wants that too, but often goes further on executable traceability. The expectation is not just to describe the analysis well. The expectation is to make it reproducible enough for review.

For confirmatory studies, PMDA expects the primary analysis program in principle, along with supporting programs for ADaM creation and key efficacy, safety, or dose-setting outputs where relevant. If the exact program cannot be submitted, the fallback is still a detailed algorithm description, not silence.

This changes how statisticians and programmers should prepare the package. A package that feels well documented for FDA can still be weak for PMDA if the derivation chain is hard to reconstruct.

7. MedDRA/J matters, even though the PT code stays the anchor

MedDRA/J is the Japanese localization of MedDRA. The key point for SDTM work is that the code remains stable, while the label surface may be Japanese. That means the safest join point across Japanese and English safety content is the code, not the term label.

For PMDA SDTM submission, variables such as AEDECOD, AESOC, AEHLT, and related coded fields should still align with standard MedDRA concepts expected by SDTM. If the source or coding activity used Japanese labels, that Japanese content belongs in the paired Japanese dataset or related documented handling path.

PMDA is also stricter in wording around controlled terminology use. It expects accepted standards, terminology, and dictionaries to be used without changing spelling, notation, or capitalization.

8. JADER is not an SDTM target, but it still matters

JADER is PMDA's post-marketing adverse drug reaction database. It is not a clinical SDTM submission target. That point needs to stay clean, because people often blur these two ideas.

JADER still matters for advanced programmers and statisticians because teams sometimes need to align or compare JADER safety data with internal clinical trial safety data. When that happens, the Japanese labeling, MedDRA/J usage, and local safety conventions become part of the mapping problem.

JADER Table	Content	Nearest SDTM Analog
DEMO	Patient demographics, outcome, report details	DM, DS
DRUG	Suspected and concomitant drugs	CM, EX
REAC	Adverse reactions coded with MedDRA/J	AE
HIST	History / background disease information	MH

One practical limitation is age granularity. JADER often carries age in grouped bins, not exact age. That matters if a team tries to compare JADER signals directly with FAERS or trial data without adjusting the analysis plan.

9. Japanese date and time handling can quietly break traceability

SDTM --DTC variables must still be ISO 8601 for both FDA and PMDA. The issue is not the SDTM output format. The issue is the source data.

Japanese source systems may contain era-based dates rather than plain Gregorian dates. The current era is Reiwa (令和), which began on May 1, 2019. The previous era was Heisei (平成), which ran from 1989 to April 30, 2019. Before that was Showa (昭和).

So a source date like 令和6年3月15日 needs conversion to 2024-03-15. A date like 平成31年4月30日 converts to 2019-04-30. This is not just formatting. It is source interpretation and QC.

/* Era conversion reference */
/* Showa  : 1926-12-25 to 1989-01-07 */
/* Heisei : 1989-01-08 to 2019-04-30 */
/* Reiwa  : 2019-05-01 onward        */

/* Reiwa Year N  = Gregorian Year (2018 + N)  */
/* Heisei Year N = Gregorian Year (1988 + N)  */

/* Example:
   令和6  -> 2024
   平成31 -> 2019
*/

The 2019 transition window needs extra QC. Any site data around the Heisei-to-Reiwa change should be checked carefully before conversion and SDTM derivation.

Time zone also matters. Japan Standard Time is UTC+9 and does not use daylight saving time. In mixed-region studies, local source time, normalized analysis time, and SDTM submission time should not be treated as the same thing by default.

10. Validation is not just "run P21 once"

This is one of the biggest mistakes in dual submissions. PMDA uses its own validation rule set within Pinnacle 21 Enterprise. FDA-facing validation and PMDA-facing validation are not interchangeable.

The same dataset can behave differently depending on:

the rule set in use
the engine selected
the accepted standards version at filing
how each agency handles severity and reviewer explanation

So "we ran P21" is not enough. The real question is, "Which rules, which engine, and for which authority?"

Warning: Do not assume an FDA-clean validation report predicts PMDA acceptance. PMDA validation planning should be treated as its own workstream.

11. Controlled terminology drift is handled more tightly on the PMDA side

Both agencies depend on CDISC controlled terminology, but PMDA's wording is stricter around exact notation. It expects accepted terminology and dictionaries to be used without changing spelling, notation, or capitalization.

This matters in real programming work. Old sponsor terms, local lab carryover values, and slightly edited codelist values that may survive longer in FDA-focused workflows are more likely to become PMDA review points.

Another practical difference is laboratory terminology. FDA reviewers may ask more questions around LOINC alignment. PMDA does not place the same visible weight on that point in its SDTM conformance posture.

12. SEND is still a scope difference

FDA requires SEND for applicable nonclinical submissions. PMDA does not currently require SEND in the same way. That is a plain scope difference teams should keep separate from the SDTM conversation.

Summary table

Area	FDA	PMDA
Standards timing	Tied to supported catalog windows and study timing	More tightly tied to standards accepted at submission
Pre-submission data consultation	No direct equivalent gate	Formal consultation plays a major role before filing
Japanese text handling	English-only submission model	Paired Japanese and alphanumeric dataset model when needed
Clinical pharmacology traceability	Strongly expected, often explained through metadata	Tighter expectation around PP, linkage, and reproducibility
Programs and reproducibility	Reviewer docs emphasized	Programs and algorithm traceability weigh more heavily
Validation behavior	Authority-specific review posture	Authority-specific rule set with stricter blocking behavior in practice
Japanese era dates	Usually not a source issue	Can require source interpretation and dedicated QC
MedDRA localization	International MedDRA workflow	MedDRA/J context with the code as the stable bridge
Post-marketing safety database	FAERS	JADER

Final point

FDA and PMDA both accept SDTM. That is the easy part.

The hard part is how each agency expects the package to behave in review. PMDA is not just FDA with Japanese labels on top. It is a different operating environment, with different pressure points.

Teams that do well with PMDA usually make those decisions early, not a few weeks before filing.

Tags: PMDA, FDA, SDTM, Define.xml, MedDRA/J, JADER, Clinical Pharmacology, Pinnacle 21, Regulatory Submissions

Define.xml for SUPPQUAL — Getting QNAM-Level Metadata Right

2026-04-03T21:18:00.002-04:00

If you have worked on SDTM submissions long enough, you know SUPPQUAL define.xml is where packages start to break down. Not because the data is wrong, but because the metadata does not fully explain what the data represents.

This is not a recap of SUPPQUAL structure. This is about how define.xml actually fails in submission and how to fix it before a reviewer points it out.

SUPPQUAL is not difficult because of structure. It is difficult because the meaning of the data exists only in define.xml.

Table of Contents

The SUPPQUAL ItemGroupDef — What the Spec Actually Requires
Value-Level Metadata — Why SUPPQUAL Demands It
Building QNAM-Level VLM Entries Correctly
WhereClauseDef Construction — Mechanics and Traps
Origin Tracing for SUPPQUAL Variables
Controlled Terminology in SUPPQUAL QVAL — Who Owns the Codelist?
Common Submission Rejection Patterns
PMDA-Specific Considerations
SAS Utility: Generating VLM Entries Programmatically
Pre-Submission Checklist
IDVAR / IDVARVAL — The Hidden Failure Point
SUPPQUAL vs Custom Domain — Design Decision
What Pinnacle 21 Will NOT Catch
Scaling Problems in Large SUPPQUAL Domains
Define.xml v2.0 vs v2.1 — What Changes for SUPPQUAL
Edge Cases You Will Hit
How Reviewers Actually Read SUPPQUAL
Cross-Domain Consistency — The Silent Check
Levels of Automation — Maturity Model
Bad vs Good — Full Picture

1. The SUPPQUAL ItemGroupDef — What the Spec Actually Requires

Start with the foundation. A SUPPQUAL dataset in CDISC SDTM is a special-purpose structure with a fixed set of variables: STUDYID, RDOMAIN, USUBJID, IDVAR, IDVARVAL, QNAM, QLABEL, and QVAL. Every SUPPQUAL dataset carries these same column names regardless of what domain it hangs off. That fixed structure is what makes define.xml hard — the column names tell you nothing about what any given row contains.

In define.xml (Define-XML v2.0 and v2.1), a SUPPQUAL domain is represented as an ItemGroupDef whose OID typically follows IG.SUPPXX convention. Inside that ItemGroupDef, you declare ItemRefs for the eight structural columns. That part is routine. The complexity begins with QNAM and QVAL.

The FDA Technical Conformance Guide (TCG), the CDISC Define-XML 2.0 specification, and the CDISC SDTM IG all converge on the same expectation: every distinct QNAM value that appears in the dataset must have a corresponding Value-Level Metadata entry in define.xml. Not a collective entry. Not a reference to the QNAM column generally. Each QNAM individually, with its own label, data type, origin, and — where applicable — codelist or controlled terminology reference.

This is the single most common gap in SUPPQUAL define packages. Programmers correctly document the column-level metadata for QNAM (type=text, origin=Predecessor, etc.) but never create the value-level entries that tell reviewers what QNAM="AESLIFE" or QNAM="LBMETHOD" actually means in context. FDA reviewers are specifically checking for VLM completeness in SUPP-- domains.

Here is the minimal correct ItemGroupDef structure for a SUPPAE domain:

<!-- ItemGroupDef for SUPPAE -->
<def:ItemGroupDef OID="IG.SUPPAE"
                  Name="SUPPAE"
                  Repeating="Yes"
                  IsReferenceData="No"
                  SASDatasetName="SUPPAE"
                  def:Structure="Supplemental Qualifiers for AE"
                  def:Purpose="Tabulation"
                  def:StandardOID="STD.SDTMIG.3.3"
                  def:ArchiveLocationID="LF.SUPPAE">
  <Description>
    <TranslatedText xml:lang="en">
      Supplemental Qualifiers for Adverse Events
    </TranslatedText>
  </Description>
  <ItemRef ItemOID="IT.SUPPAE.STUDYID"  OrderNumber="1"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.RDOMAIN"  OrderNumber="2"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.USUBJID"  OrderNumber="3"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.IDVAR"    OrderNumber="4"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.IDVARVAL" OrderNumber="5"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.QNAM"     OrderNumber="6"  Mandatory="Yes"
           def:KeySequence="1"/>
  <ItemRef ItemOID="IT.SUPPAE.QLABEL"   OrderNumber="7"  Mandatory="Yes"/>
  <ItemRef ItemOID="IT.SUPPAE.QVAL"     OrderNumber="8"  Mandatory="Yes"/>
  <def:leaf ID="LF.SUPPAE" xlink:href="suppae.xpt">
    <def:title>suppae.xpt</def:title>
  </def:leaf>
</def:ItemGroupDef>

Note def:KeySequence="1" on the QNAM ItemRef. This is required to signal to the define viewer that QNAM functions as the discriminator key within this row structure. Some older define packages omit it. Reviewers notice.

2. Value-Level Metadata — Why SUPPQUAL Demands It

Value-Level Metadata (VLM) in define.xml exists to document variables whose meaning is row-dependent. In most SDTM datasets, a column has one meaning. AETERM always means adverse event term. You document it once at the column level and you are done.

SUPPQUAL breaks this. QVAL can contain a free-text description in one row, a numeric value in another, a controlled term from a codelist in a third. The column-level metadata for QVAL — because it must accommodate everything — is necessarily generic. It cannot tell the reviewer whether QVAL for QNAM="AESLIFE" should be Y/N, whether QVAL for QNAM="LBMETHOD" maps to a codelist, or whether QVAL for QNAM="AOCCIFL" is a character flag with a specific set of permissible values.

VLM fixes this. It is the mechanism by which you attach row-specific metadata to a column. In define.xml v2.0/v2.1, VLM is implemented using def:ValueListDef elements referenced from a def:ValueListRef attribute on the QVAL ItemDef. Each entry inside the ValueListDef is a separate ItemRef, constrained by a WhereClauseRef that scopes it to a specific QNAM value.

SUPPQUAL Interpretation Flow:
QNAM → WhereClause → VLM ItemDef → Origin / Codelist → Reviewer Understanding

Every link in this chain must be explicit and correct. A break at any point means the reviewer cannot interpret the variable — and will write a query instead.

Think of it as a lookup table stitched into the define.xml structure itself. The reviewer opens the define viewer, clicks on QVAL in SUPPAE, and instead of seeing a single generic description, they see a structured list of every QNAM with its own label, type, origin, and optionally a codelist link.

3. Building QNAM-Level VLM Entries Correctly

The complete VLM implementation for SUPPQUAL requires four interconnected XML components working together. Get any one wrong and the define viewer renders garbage or the validator throws errors.

3.1 The ValueListDef Block

You declare one def:ValueListDef per SUPPQUAL dataset. Its OID is referenced from the QVAL ItemDef. Inside it, one ItemRef per distinct QNAM value in your dataset.

<def:ValueListDef OID="VL.SUPPAE.QVAL">

  <!-- Entry for QNAM = AESLIFE -->
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AESLIFE"
           OrderNumber="1"
           Mandatory="Yes">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AESLIFE"/>
  </ItemRef>

  <!-- Entry for QNAM = AECONTRT -->
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AECONTRT"
           OrderNumber="2"
           Mandatory="Yes">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AECONTRT"/>
  </ItemRef>

  <!-- Entry for QNAM = AOCCIFL -->
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AOCCIFL"
           OrderNumber="3"
           Mandatory="No">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AOCCIFL"/>
  </ItemRef>

  <!-- Entry for QNAM = AERELNST -->
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AERELNST"
           OrderNumber="4"
           Mandatory="No">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AERELNST"/>
  </ItemRef>

</def:ValueListDef>

The Mandatory attribute here reflects whether every subject/record that has a parent AE record must have this QNAM populated. This is a clinical judgment, not a programming one. Get input from your data manager.

3.2 The QVAL ItemDef with ValueListRef

The QVAL ItemDef at column level must carry a def:ValueListRef pointing to your ValueListDef. This is the hook that connects column metadata to value-level metadata.

<ItemDef OID="IT.SUPPAE.QVAL"
         Name="QVAL"
         DataType="text"
         Length="200"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Result Value for the Supplemental Qualifier
    </TranslatedText>
  </Description>
  <!-- This is the critical link to VLM -->
  <def:ValueListRef ValueListOID="VL.SUPPAE.QVAL"/>
</ItemDef>

The Length on the column-level QVAL ItemDef should match the actual XPT variable length. The VLM-level ItemDefs for each QNAM can specify shorter lengths that reflect the actual maximum length for that specific qualifier. FDA reviewers check for length consistency.

3.3 The VLM-Level ItemDefs

Each QNAM gets its own ItemDef. This is where the clinical meaning, data type, codelist reference, and origin go. This is the piece that most define packages either skip entirely or populate with placeholder text.

<!-- AESLIFE: Life Threatening -->
<ItemDef OID="IT.SUPPAE.QVAL.AESLIFE"
         Name="AESLIFE"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Indicator of whether the adverse event was life-threatening.
      Populated from the Life-Threatening field on the SAE page.
    </TranslatedText>
  </Description>
  <CodeListRef CodeListOID="CL.NY"/>
  <def:Origin Type="CRF">
    <def:DocumentRef leafID="LF.CRF">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="AE_SAE_PAGE"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

<!-- AECONTRT: Concomitant Treatment Given -->
<ItemDef OID="IT.SUPPAE.QVAL.AECONTRT"
         Name="AECONTRT"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Indicator of whether concomitant treatment was given for the AE.
    </TranslatedText>
  </Description>
  <CodeListRef CodeListOID="CL.NY"/>
  <def:Origin Type="CRF">
    <def:DocumentRef leafID="LF.CRF">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="AE_DETAILS_PAGE"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

<!-- AOCCIFL: Any Occurrence Indicator Flag -->
<ItemDef OID="IT.SUPPAE.QVAL.AOCCIFL"
         Name="AOCCIFL"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Flag indicating the first occurrence of an AE with the same
      preferred term. Derived based on chronological order within subject.
    </TranslatedText>
  </Description>
  <CodeListRef CodeListOID="CL.NY"/>
  <def:Origin Type="Derived">
    <def:DocumentRef leafID="LF.SUPPAE_SPECS">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="SUPPAE_DERIVATION"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

<!-- AERELNST: Relationship to Non-Study Treatment -->
<ItemDef OID="IT.SUPPAE.QVAL.AERELNST"
         Name="AERELNST"
         DataType="text"
         Length="50"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Relationship of the adverse event to a non-study treatment.
      Free text captured on the AE CRF.
    </TranslatedText>
  </Description>
  <!-- No CodeListRef — free text field -->
  <def:Origin Type="CRF">
    <def:DocumentRef leafID="LF.CRF">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="AE_RELATIONSHIP_PAGE"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

Several things to notice here. First, SASFieldName="QVAL" on every VLM-level ItemDef. This is correct — the actual XPT variable being described is QVAL regardless of which QNAM you are documenting. Second, the Length at VLM level reflects the actual maximum length for that qualifier's values, not the dataset-level QVAL length. Third, the CodeListRef is present only when the QNAM has controlled values. Free-text QNAMs get no CodeListRef. This distinction matters — a reviewer who sees a codelist reference for a free-text field will flag it.

4. WhereClauseDef Construction — Mechanics and Traps

The def:WhereClauseDef is what scopes each VLM entry to its QNAM. It defines the condition "this metadata applies when QNAM equals this value." Getting WhereClauseDef wrong is the second most common source of define validation errors in SUPPQUAL packages.

4.1 Standard WhereClauseDef Structure

<!-- WhereClause for QNAM = AESLIFE -->
<def:WhereClauseDef OID="WC.SUPPAE.QNAM.AESLIFE">
  <RangeCheck Comparator="EQ"
              SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>AESLIFE</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

<!-- WhereClause for QNAM = AECONTRT -->
<def:WhereClauseDef OID="WC.SUPPAE.QNAM.AECONTRT">
  <RangeCheck Comparator="EQ"
              SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>AECONTRT</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

<!-- WhereClause for QNAM = AOCCIFL -->
<def:WhereClauseDef OID="WC.SUPPAE.QNAM.AOCCIFL">
  <RangeCheck Comparator="EQ"
              SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>AOCCIFL</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

The def:ItemOID attribute on the RangeCheck must point to the ItemDef for QNAM within the same SUPPQUAL dataset — specifically IT.SUPPAE.QNAM in this example. Not a generic QNAM OID. Not a cross-domain reference. The OID must resolve to the QNAM column definition within this specific SUPPQUAL domain.

Common mistake: reusing WhereClauseDef OIDs across SUPPQUAL domains. If you build SUPPAE and SUPPLB and give both the same WC OIDs for shared QNAM names (like QNAM=FAST or QNAM=SPEC), validators will throw duplicate OID errors or silently cross-link the wrong ItemOID references. Every domain needs its own WhereClauseDef set with domain-scoped OIDs and domain-specific ItemOID references.

4.2 The SoftHard Attribute

Use SoftHard="Soft" for SUPPQUAL WhereClause entries. A Hard constraint implies the data should fail a range check if the condition is violated. In a SUPP context the WhereClause is not a validation rule — it is a scoping filter. Soft is correct. Some define generators default to Hard. Check your output.

4.3 Case Sensitivity in CheckValue

The value inside <CheckValue> must match the actual QNAM values in the XPT exactly, including case. SAS XPT is case-preserving for character values. If your dataset has QNAM="AESlife" in even one row and your WhereClauseDef has <CheckValue>AESLIFE</CheckValue>, the VLM entry will not resolve for those rows and Pinnacle 21 will flag the mismatch. Validate against the actual unique QNAM values in your dataset before finalizing define.xml.

4.4 Multi-Condition WhereClause (Rare but Real)

Occasionally a qualifier's meaning changes depending on the parent IDVAR. If QNAM="VISIT" behaves differently when IDVAR="AESEQ" versus IDVAR="MHSEQ" — which should not happen in well-designed SDTM but does happen in rescue mapping situations — you can build a multi-condition WhereClause:

<def:WhereClauseDef OID="WC.SUPPAE.QNAM.VISIT.AESEQ">
  <RangeCheck Comparator="EQ"
              SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>VISIT</CheckValue>
  </RangeCheck>
  <RangeCheck Comparator="EQ"
              SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.IDVAR">
    <CheckValue>AESEQ</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

Multiple RangeCheck elements inside one WhereClauseDef are evaluated as AND conditions by the Define-XML specification. Use this sparingly. If you find yourself doing this frequently, it is usually a sign that the SUPPQUAL design itself needs revisiting before worrying about the define.

5. Origin Tracing for SUPPQUAL Variables

Origin documentation for SUPPQUAL is where the real intellectual work lives. It is also where most packages cut corners. Regulatory reviewers — particularly FDA and PMDA — are increasingly using define.xml as an audit instrument, not just a reference document. The origin chain must be defensible.

5.1 The Define-XML Origin Types and What They Mean in SUPPQUAL Context

Origin Type	When to Use in SUPPQUAL	What Reviewers Expect
`CRF`	QVAL is directly transcribed from a CRF field	PDF page reference pointing to the exact CRF question. NamedDestination preferred over page numbers.
`Derived`	QVAL is computed from other data (flags, first-occurrence logic, duration calculations)	Reference to derivation specs or annotated CRF note explaining the logic. Reviewers want to see the method, not just "Derived."
`Assigned`	QVAL comes from a sponsor-assigned value not captured on a CRF (study day calculations, batch assignments)	Some reference to the assigning entity or protocol specification.
`Predecessor`	QVAL is carried over or transformed from a prior dataset or CDASH mapping — use with caution in SUPPQUAL	The predecessor source should be traceable. Generic "Predecessor" with no document reference is not acceptable in modern submissions.
`Protocol`	QVAL represents a protocol-defined classification not captured explicitly in the CRF	Reference to specific protocol section or amendment.

5.2 CRF Origin — Getting the PDFPageRef Right

The most common origin in SUPPQUAL is CRF. The structure requires a DocumentRef pointing to the annotated CRF leaf, with a PDFPageRef specifying where in the CRF the field appears.

<def:Origin Type="CRF">
  <def:DocumentRef leafID="LF.ACRF">
    <def:PDFPageRef
      Type="NamedDestination"
      PageRefs="AE_PAGE_SAE_CRITERIA"/>
  </def:DocumentRef>
</def:Origin>

The Type on PDFPageRef should be NamedDestination if your annotated CRF has named bookmark anchors, or PhysicalPage if you are using page numbers. Named destinations are more stable across CRF revisions. If your CRF authoring tool supports it, insist on named destinations. Physical page numbers shift whenever a CRF page is added or removed, and a define.xml built against page numbers becomes inaccurate with each CRF version increment.

The leafID must match the ID attribute of a def:leaf element declared elsewhere in your define.xml. That leaf must point to a document that actually exists in the submission package. Broken leaf references fail define validation. Cross-check the leaf IDs against your actual eSub folder structure before finalizing.

5.3 Derived Origin — Giving Reviewers Enough Information

Derived QNAMs are the hardest to document well. The spec says to include a DocumentRef, but many programmers point to a general specifications document rather than the specific derivation. This is a missed opportunity.

The minimum acceptable Derived origin documentation in 2024+ submissions includes: a reference document that describes the derivation logic, a page or named anchor within that document that shows the specific algorithm, and — where the derivation is non-trivial — a Description element on the ItemDef that explains the method in enough plain language that a reviewer can understand it without opening the spec document.

<ItemDef OID="IT.SUPPAE.QVAL.AOCCIFL"
         Name="AOCCIFL"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Any occurrence indicator flag. Set to 'Y' for the first chronological
      occurrence of an AE preferred term within a subject, based on AESTDTC
      ascending, then AESEQ ascending. All subsequent occurrences of the same
      preferred term for the same subject are left blank.
    </TranslatedText>
  </Description>
  <def:Origin Type="Derived">
    <def:DocumentRef leafID="LF.SDTM_SPECS">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="SUPPAE_AOCCIFL_DERIVATION"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

Notice that the Description element does the work. The DocumentRef points a reviewer to the formal specs. But the description alone is enough for a reviewer to validate the derivation without opening anything. That is the standard to aim for.

5.4 The Structural Variable Origin Problem

RDOMAIN, IDVAR, IDVARVAL, QLABEL — these structural SUPPQUAL variables trip up many define packages. Their origin is technically Assigned or Derived in most implementations, because they are constructed by the programmer rather than collected on a CRF. But they are not user-facing clinical data in the same way QVAL is. Many programmers leave them as Predecessor or assign them a generic "Assigned" without further documentation.

The FDA TCG expectation is that IDVAR and IDVARVAL have clear origin documentation that explains which variable in the parent domain they reference. A note in the Description element for IDVAR stating "Contains the name of the key variable in the parent AE domain used to link supplemental records. Populated with 'AESEQ'" is significantly better than a bare Assigned origin.

6. Controlled Terminology in SUPPQUAL QVAL — Who Owns the Codelist?

When a QNAM takes values from a controlled terminology, the VLM-level ItemDef for that QNAM should carry a CodeListRef. The question is: which codelist, and where is it defined?

6.1 CDISC Standard Codelists

Most Y/N flags in SUPPQUAL map to the CDISC NY codelist. Reference it exactly as you would from any other domain. The CodeListDef for NY goes in the global CodeLists section of your define.xml, not inside the SUPPQUAL section.

<!-- In the CodeLists section of define.xml -->
<CodeList OID="CL.NY"
          Name="No Yes Response"
          DataType="text"
          def:StandardOID="STD.CT.2024-09-27">
  <ExternalCodeList
    Dictionary="NCI"
    Version="2024-09-27"
    ref="C66742"/>
</CodeList>

<!-- In the VLM ItemDef -->
<ItemDef OID="IT.SUPPAE.QVAL.AESLIFE" ...>
  ...
  <CodeListRef CodeListOID="CL.NY"/>
</ItemDef>

6.2 Sponsor-Defined Codelists for SUPPQUAL

Some QNAMs have permissible values that are sponsor-defined — not from CDISC terminology. A common example is a dose escalation category or a protocol-specific severity classification that appears as a supplemental qualifier. For these, you define a local CodeList with a clear naming convention and mark it appropriately.

<CodeList OID="CL.SUPPAE.AESCATYP"
          Name="AE Categorization Type"
          DataType="text">
  <EnumeratedItem CodedValue="INFUSION RELATED"
                  def:ExtendedValue="No"/>
  <EnumeratedItem CodedValue="HYPERSENSITIVITY"
                  def:ExtendedValue="No"/>
  <EnumeratedItem CodedValue="CRS"
                  def:ExtendedValue="No"/>
</CodeList>

For sponsor-defined codelists, use def:ExtendedValue="No" on each EnumeratedItem to indicate these are the complete permissible values and not extensions of an external dictionary. If your codelist extends a CDISC codelist by adding sponsor-specific terms, use def:ExtendedValue="Yes" on the added items and reference the parent standard codelist via ExternalCodeList.

6.3 QNAMs That Should Not Have a Codelist

Free-text QNAMs — verbatim descriptions, reason fields, comment fields — must not carry a CodeListRef. This sounds obvious, but a common error occurs when a programmer builds a template from an existing QNAM that does have a codelist and forgets to strip the CodeListRef when adding a free-text QNAM. The result is a define.xml that claims a free-text field has controlled permissible values, which Pinnacle 21 will flag as a terminology inconsistency and which FDA reviewers will question.

7. Common Submission Rejection Patterns

These are patterns drawn from actual FDA and EMEA reviewer feedback letters and study data validation report (SDVR) findings. They cluster into five categories.

7.1 Missing VLM for One or More QNAMs

Rejection Pattern: "Value-level metadata not provided for QNAM values [list]. Each unique QNAM in SUPP-- datasets must have corresponding VLM entries in define.xml." This is the most frequent finding. It usually happens when a QNAM is added late in the study lifecycle — a new flag requested by biostatistics after the define package was already built — and the VLM entry is either forgotten or added only to the dataset without updating define.xml.

Prevention: Your define.xml build process should include a programmatic check that compares the unique QNAM values in each production SUPP XPT against the WhereClauseDef CheckValues in the corresponding ValueListDef. Any QNAM in the dataset with no matching WhereClause is a define gap. Run this check as part of your define QC program, not as a manual step.

/* SAS: Check for QNAMs missing from VLM */
/* Assumes you have parsed define.xml WhereClause values into
   a dataset called vlm_qnams with fields: domain, qnam_value */

proc sql;
  create table missing_vlm as
  select distinct a.rdomain,
                  a.qnam,
                  "No VLM entry in define.xml" as issue
  from suppae a
  left join vlm_qnams b
    on upcase(a.qnam) = upcase(b.qnam_value)
    and b.domain = 'SUPPAE'
  where b.qnam_value is null;
quit;

proc print data=missing_vlm noobs; run;

7.2 QNAM Values in Dataset Do Not Match CheckValue in WhereClauseDef

Rejection Pattern: "WhereClause condition for [DOMAIN].QNAM EQ [value] does not match observed QNAM values in dataset. Observed: [AEOSPTA], WhereClause: [AEOSTPA]." This is a typo problem, pure and simple. The QNAM name in the CheckValue element is not identical to the QNAM string in the XPT. Usually a transposition error or a case mismatch discovered after the fact.

Prevention: Never hand-type QNAM values into WhereClauseDef CheckValue elements. Generate them programmatically from the dataset itself, or at minimum diff your define.xml CheckValues against a proc freq output of the actual dataset QNAMs.

7.3 Origin Type "CRF" with No PDFPageRef or Broken LeafID

Rejection Pattern: "Origin Type=CRF specified for [QNAM] but no CRF page reference provided. Unable to locate source question in annotated CRF." The origin says CRF but either the DocumentRef is missing, the leafID does not resolve, or the PDFPageRef points to a named destination that does not exist in the annotated CRF PDF.

Prevention: Validate all leafIDs against actual files present in the submission package. Validate all NamedDestination values against the bookmarks/destinations actually present in the CRF PDF. Both are scriptable checks — PDF bookmark extraction via Python or a SAS DDE/shell call is straightforward and should be part of your define QC process.

7.4 Derived QNAMs with No Explanation of Derivation Logic

Rejection Pattern: "Origin Type=Derived specified for [QNAM] but derivation logic not documented in define.xml or referenced specifications. Please provide the algorithm used to derive this variable." The define says Derived, the DocumentRef points to a general specs document, but there is no specific derivation description anywhere accessible to the reviewer.

Prevention: For every Derived QNAM, the Description element on the VLM ItemDef should contain enough information that a reviewer can understand the derivation method without opening a separate document. The DocumentRef is supplementary, not a replacement for the in-line description.

7.5 Inconsistent Length Between Column-Level and VLM-Level ItemDef

Rejection Pattern: "VLM-level Length for QNAM=[value] exceeds column-level Length for QVAL. VLM Length should not exceed parent column length." The column-level QVAL ItemDef declares Length=200, but a specific VLM ItemDef for one QNAM declares Length=250. This is internally inconsistent — a VLM entry cannot describe data that is longer than the column that contains it.

Prevention: VLM-level lengths should always be less than or equal to the column-level length for QVAL. In practice, column-level QVAL length should match the XPT variable length, and VLM lengths should reflect the actual maximum observed length for each QNAM's values. Run proc contents and proc means maxdec=0 on QVAL grouped by QNAM to determine appropriate VLM lengths.

/* Get max QVAL length by QNAM for length documentation */
proc sql;
  create table qval_lengths as
  select rdomain,
         qnam,
         max(length(strip(qval))) as max_qval_length
  from suppae
  group by rdomain, qnam
  order by rdomain, qnam;
quit;

proc print data=qval_lengths noobs label;
  label max_qval_length = "Max QVAL Length";
run;

7.6 QLABEL in Define.xml Does Not Match Actual QLABEL Values in Dataset

Rejection Pattern: "Label documented in define.xml Description element for QNAM=[value] does not match QLABEL observed in dataset. Define: 'Life Threatening Event', Dataset: 'Life-Threatening'." QLABEL is a character variable in the SUPPQUAL dataset. Its value must be consistent across all rows for a given QNAM, and the Description on the corresponding VLM ItemDef must reflect this label accurately.

The QLABEL in your dataset is the authoritative source. Your define.xml Description element for each QNAM VLM entry should use the same phrasing. If QLABEL="Life-Threatening" in the dataset, the VLM Description should say "Life-Threatening" in its label description, not a longer or differently punctuated form. FDA reviewers do exact text comparisons between define.xml and dataset values in SDVR tooling.

8. PMDA-Specific Considerations

PMDA submissions add layers beyond FDA requirements. If you are delivering a Japan package, several SUPPQUAL define.xml behaviors require specific attention.

8.1 Bilingual Descriptions

PMDA increasingly expects Japanese-language TranslatedText elements alongside English descriptions, particularly for VLM ItemDefs in SUPPQUAL. This applies to the Description element. Using a single xml:lang="en" element is technically valid per the schema but draws reviewer queries in Japan submissions. The correct pattern:

<Description>
  <TranslatedText xml:lang="en">
    Indicator of whether the adverse event was life-threatening.
  </TranslatedText>
  <TranslatedText xml:lang="ja">
    有害事象が生命を脅かすものであったかどうかを示す指標。
  </TranslatedText>
</Description>

If you do not have translation resources, at minimum ensure the English description is precise enough that a Japanese reviewer using machine translation can derive accurate meaning. Ambiguous English descriptions compounded by imperfect machine translation is a known source of PMDA reviewer queries.

8.2 PMDA Requires QNAM-Level Variable Metadata in the Data Definition Document

PMDA validation checklists specifically call out that SUPPQUAL QNAM values should be documented as variables in the data definition document (essentially define.xml) with labels and derivation rules equivalent to how named variables are documented in non-SUPP domains. Their reviewers check this against the SDTM datasets using their own tooling.

8.3 Encoding and Character Width in QLABEL

PMDA submissions often involve Japanese character data in non-SUPP domains but QLABEL in SUPPQUAL is almost always ASCII English. However, if your submission includes Japanese QLABEL values, verify that the XPT character encoding documentation in define.xml (via the def:CommentDef mechanism or a dedicated annotation) explicitly acknowledges the encoding. PMDA has flagged submissions where the define.xml implied ASCII-only encoding but the dataset contained multi-byte characters.

9. SAS Utility: Generating VLM Entries Programmatically

Manually authoring VLM entries for large SUPPQUAL domains — SUPPCM with 20+ QNAMs, SUPPLB with method and specimen qualifiers — is error-prone and time-consuming. Build a generation utility that takes a metadata specs dataset as input and outputs the XML fragments for ValueListDef, WhereClauseDef, and ItemDef elements.

9.1 Input Metadata Dataset Structure

/* Define the metadata specs dataset for SUPPQUAL VLM generation */
/* One row per unique QNAM per SUPPQUAL domain */

data suppqual_vlm_specs;
  length domain    $8
         qnam      $8
         qlabel    $40
         datatype  $10
         length_    8
         origin    $20
         codelist  $40
         crf_dest  $80
         derivation_text $500;
  infile cards dsd;
  input domain $ qnam $ qlabel $ datatype $ length_
        origin $ codelist $ crf_dest $ derivation_text $;
cards;
SUPPAE,AESLIFE,Life-Threatening,text,1,CRF,CL.NY,AE_SAE_PAGE,.
SUPPAE,AECONTRT,Concomitant Treatment Given,text,1,CRF,CL.NY,AE_DETAILS_PAGE,.
SUPPAE,AOCCIFL,Any Occurrence Indicator Flag,text,1,Derived,CL.NY,.,Flag for first occurrence by PT within subject based on AESTDTC ascending
SUPPAE,AERELNST,Relationship to Non-Study Therapy,text,50,CRF,..,AE_RELATIONSHIP_PAGE,.
;
run;

9.2 XML Generation Macro

%macro gen_supp_vlm(domain=, specs_ds=, outfile=);

  /* Step 1: Get unique QNAMs ordered by sequence */
  proc sort data=&specs_ds.(where=(domain="&domain."))
            out=_specs;
    by domain qnam;
  run;

  filename vlm_out "&outfile.";
  data _null_;
    file vlm_out lrecl=32767;

    /* ValueListDef opening tag */
    put "<def:ValueListDef OID=""VL.&domain..QVAL"">";

    set _specs end=last;
    by domain;

    seq + 1;

    /* ItemRef within ValueListDef */
    put '  <ItemRef ItemOID="IT.' domain +(-1) '.QVAL.' qnam +(-1) '"';
    put '           OrderNumber="' seq +(-1) '"';
    put '           Mandatory="Yes">';
    put '    <def:WhereClauseRef WhereClauseOID="WC.' domain +(-1) '.QNAM.' qnam +(-1) '"/>';
    put '  </ItemRef>';

    if last then put "</def:ValueListDef>";
  run;

  /* Step 2: WhereClauseDefs */
  data _null_;
    file vlm_out lrecl=32767 mod;
    set _specs;

    put '<def:WhereClauseDef OID="WC.' domain +(-1) '.QNAM.' qnam +(-1) '">';
    put '  <RangeCheck Comparator="EQ" SoftHard="Soft"';
    put '              def:ItemOID="IT.' domain +(-1) '.QNAM">';
    put '    <CheckValue>' qnam +(-1) '</CheckValue>';
    put '  </RangeCheck>';
    put '</def:WhereClauseDef>';
  run;

  /* Step 3: VLM-level ItemDefs */
  data _null_;
    file vlm_out lrecl=32767 mod;
    set _specs;

    put '<ItemDef OID="IT.' domain +(-1) '.QVAL.' qnam +(-1) '"';
    put '         Name="' qnam +(-1) '"';
    put '         DataType="' datatype +(-1) '"';
    put '         Length="' length_ +(-1) '"';
    put '         SASFieldName="QVAL">';
    put '  <Description>';
    put '    <TranslatedText xml:lang="en">' qlabel +(-1) '</TranslatedText>';
    put '  </Description>';

    if codelist ne '.' then
      put '  <CodeListRef CodeListOID="' codelist +(-1) '"/>';

    if origin = 'CRF' then do;
      put '  <def:Origin Type="CRF">';
      put '    <def:DocumentRef leafID="LF.ACRF">';
      put '      <def:PDFPageRef Type="NamedDestination" PageRefs="'
          crf_dest +(-1) '"/>';
      put '    </def:DocumentRef>';
      put '  </def:Origin>';
    end;
    else if origin = 'Derived' then do;
      put '  <def:Origin Type="Derived">';
      put '    <def:DocumentRef leafID="LF.SDTM_SPECS"/>';
      put '  </def:Origin>';
    end;

    put '</ItemDef>';
  run;

  filename vlm_out clear;
%mend gen_supp_vlm;

/* Usage */
%gen_supp_vlm(
  domain   = SUPPAE,
  specs_ds = suppqual_vlm_specs,
  outfile  = /path/to/suppae_vlm_fragments.xml
);

This is a skeleton macro. In production, extend it to handle: multi-part derivation text with proper XML escaping, structured DocumentRef with PDFPageRef per QNAM rather than a generic leaf, and character escaping for XML special characters in description text (&, <, >). Use tranwrd() chains or a dedicated XML-escape function before writing to file.

9.3 Validating the Output

After generating XML fragments and integrating them into your define.xml, validate using at minimum two tools: Pinnacle 21 Community Edition (or Enterprise if your organization has it) and the CDISC Define-XML validator at define.cdisc.org. These tools catch different classes of errors. Pinnacle 21 focuses on clinical data consistency; the CDISC validator focuses on schema conformance. Run both before any submission.

10. Pre-Submission Checklist for SUPPQUAL Define.xml

Before any define package leaves your desk for submission, verify each of the following. This is not a generic checklist — every item here maps directly to a rejection pattern seen in actual submissions.

#	Check	How to Verify
1	Every unique QNAM in the XPT has a corresponding WhereClauseDef with matching CheckValue (exact case, exact string)	Programmatic diff of proc freq(QNAM) vs CheckValue elements in define.xml
2	Every QNAM WhereClauseRef resolves to a defined WhereClauseDef OID	XML validation; Pinnacle 21 will flag unresolved OIDs
3	VLM QVAL ItemDef length ≤ column-level QVAL ItemDef length ≤ XPT QVAL variable length	Compare proc contents length output against define lengths
4	All CRF-origin QNAM entries have PDFPageRef with a NamedDestination that exists in the annotated CRF PDF	Extract PDF named destinations programmatically and cross-check
5	All Derived QNAM entries have a Description that explains the derivation in plain language	Manual review of each Derived VLM ItemDef Description element
6	No CodeListRef on free-text QNAM entries	Review all VLM ItemDefs; confirm CodeListRef absent for free-text QNAMs
7	Description text for each QNAM matches the QLABEL value used in the dataset	Compare proc freq QLABEL output against VLM Description TranslatedText values
8	All leaf IDs referenced in DocumentRef elements resolve to actual files in the submission package	Cross-reference all leafID attributes against submission folder contents
9	SoftHard="Soft" on all SUPPQUAL WhereClause RangeCheck elements	grep or XPath search for SoftHard in define.xml
10	WhereClauseDef OIDs are domain-scoped (no shared OIDs across SUPPQUAL domains)	Confirm OID naming convention includes domain prefix; check for duplicate OIDs in full define.xml
11	QVAL column-level ItemDef has def:ValueListRef attribute pointing to the correct ValueListDef OID	Direct inspection of QVAL ItemDef element in define.xml XML source
12	def:KeySequence="1" on the QNAM ItemRef within the SUPPQUAL ItemGroupDef	Direct inspection of ItemGroupDef structure in define.xml XML source
13	Cross-domain consistency checks implemented (SUPP vs parent domain) — e.g., AESLIFE=Y where AESER=N	Custom SAS QC program — not covered by Pinnacle 21

Run Pinnacle 21 after completing every item on this list, not before. Pinnacle 21 is a final gate, not a substitute for structured pre-review. It catches some things this list misses and misses some things this list catches. Both layers are necessary.

11. IDVAR / IDVARVAL — The Hidden Failure Point

Most define.xml discussions focus on QNAM and QVAL. Reviewers do not. They struggle just as much with linkage. SUPPQUAL is only interpretable if the reviewer can answer one question: which parent record does this qualifier belong to? That answer depends entirely on RDOMAIN, IDVAR, and IDVARVAL working together — and all three need explicit metadata support in define.xml.

11.1 What IDVAR Actually Represents

IDVAR is not just a variable name string. It defines the linking key into the parent domain. When RDOMAIN=AE, IDVAR=AESEQ, and IDVARVAL=12, the SUPPAE record links to the AE record where AESEQ=12 for the same USUBJID. That linkage must be unambiguous. If a reviewer cannot confirm uniqueness of the parent record — because AESEQ is not unique, or because IDVAR points to a variable that admits duplicates — the entire SUPPQUAL interpretation collapses.

11.2 Why Ambiguity Here Breaks Review

Consider a package where IDVAR=AESEQ and IDVARVAL=12 appears in a SUPPAE record. On the surface this is correct. But if the reviewer looks at the AE domain and finds three records with AESEQ=12 — a well-known error pattern in datasets with improper sequence numbering — they cannot resolve which parent record the qualifier belongs to. No validator catches this. The SUPPQUAL structure is internally consistent. The linkage is semantically broken.

If a reviewer cannot trace a SUPPQUAL row to exactly one parent record in under ten seconds, your define.xml is incomplete. The metadata should pre-empt the question, not leave the reviewer reverse-engineering your data.

Real Review Failure: "Multiple AE records share AESEQ=12 for subject 101-001. Unable to determine which AE record the SUPPAE qualifier applies to. Please confirm whether AESEQ is unique within USUBJID and provide corrected linkage documentation."

11.3 What Define.xml Should Make Clear

The Description elements for IDVAR and IDVARVAL are almost always boilerplate or blank in real submissions. They should not be. At minimum:

<!-- IDVAR ItemDef description -->
<Description>
  <TranslatedText xml:lang="en">
    Identifies the key variable in the parent AE domain used to link
    supplemental qualifier records. Populated with AESEQ. The combination
    of USUBJID and IDVARVAL uniquely identifies one AE record.
  </TranslatedText>
</Description>

<!-- IDVARVAL ItemDef description -->
<Description>
  <TranslatedText xml:lang="en">
    Value of AESEQ identifying the parent AE record for this qualifier.
    1:1 correspondence with AE.AESEQ; no parent record has more than one
    SUPPAE record per QNAM.
  </TranslatedText>
</Description>

The phrase "uniquely identifies one AE record" is doing real work here. It tells the reviewer the mapping is deterministic and they do not need to investigate further. That is what good metadata does — it removes the reviewer's doubt before the doubt forms.

11.4 When Linkage Is Non-Standard

Non-standard linkage arises in rescue mapping, merged domain scenarios, and certain oncology or endpoint-heavy programs. If IDVARVAL is derived from a concatenation, a hash, or a composite of multiple parent variables, you must say so explicitly — both in the IDVARVAL Description element and, if the derivation is complex, in a referenced specifications document.

<!-- Non-standard linkage: composite key -->
<Description>
  <TranslatedText xml:lang="en">
    IDVARVAL derived from concatenation of AESEQ (zero-padded to 4 digits)
    and VISITNUM to ensure uniqueness across repeated events at the same
    visit. Format: AESEQ_VISITNUM (e.g., "0012_3"). This composite key
    resolves to exactly one AE record per USUBJID.
  </TranslatedText>
</Description>

If you cannot write a clear derivation description for IDVARVAL, that is a signal the linkage design itself needs to be revisited before the define.xml is written.

Quick Validation Check

Before documenting IDVARVAL linkage as 1:1, verify it programmatically. This check should run as part of your standard define QC program — not as a one-off manual step.

/* Check uniqueness of parent linkage key within USUBJID */
/* Run against the parent domain BEFORE writing IDVAR metadata */
proc sql;
  select usubjid,
         aeseq,
         count(*) as n
  from ae
  group by usubjid, aeseq
  having calculated n > 1;
quit;

/* If any rows return: your IDVARVAL linkage is non-unique.
   Do NOT document as 1:1 in define.xml until this is resolved.
   Investigate whether AESEQ was reset across visits or periods. */

12. SUPPQUAL vs Custom Domain — Design Decision

Every section so far assumes the decision to use SUPPQUAL has already been made. That assumption is dangerous. Using SUPPQUAL for data that belongs in a named domain — or in a custom domain — creates bloated define.xml, unreadable VLM, and reviewer confusion that define.xml cannot fix after the fact. The metadata problem is downstream of a design problem.

12.1 The Decision Framework

Situation	Better Choice	Why
Qualifier repeats across records in a structured way (multiple specimens, multiple methods)	New SDTM domain or RELREC-linked structure	Repeated structure in SUPPQUAL creates parallel rows with no obvious grouping anchor
Variable is critical to analysis or is referenced in TFLs	Named variable in the parent domain	Analysis variables buried in SUPPQUAL require merging before use; reviewers expect key variables accessible directly
Genuinely one-off qualifier, not repeated, not critical to analysis	SUPPQUAL	This is the use case SUPPQUAL was designed for
Qualifier is sponsor-defined and used only for internal tracking	SUPPQUAL with clear Assigned origin	Acceptable if clearly documented; do not conflate with analysis variables

12.2 Warning Signs You Chose Wrong

If any of the following are true for your SUPPQUAL domain, the design decision deserves a second look before you invest time in VLM construction:

QNAM count above 25 to 30 in a single SUPPQUAL dataset. The same concept split across multiple QNAMs with a numeric suffix (FLAG1, FLAG2, FLAG3). Heavy derivation logic inside SUPPQUAL that would be simpler to express as a named derived variable. Reviewers routinely needing to merge SUPPQUAL back to the parent domain to understand the parent domain records.

When SUPPQUAL is acting like a domain, it should be a domain. The define.xml burden of a 40-QNAM SUPPLB is an order of magnitude higher than a clean 10-variable LB extension domain, and the reviewer experience is worse in every dimension.

13. What Pinnacle 21 Will NOT Catch

Pinnacle 21 validates structure. Reviewers validate meaning. These are not the same activity, and conflating them is one of the most expensive mistakes a define.xml team can make.

13.1 The Gap Between Structural Validity and Review Readiness

A define.xml can pass every Pinnacle 21 rule and still be useless to a reviewer. Here is the category of failures P21 cannot detect:

Failure Type	P21 Response	Reviewer Response
Vague Description element ("Flag" as the entire description of AOCCIFL)	✅ Pass — Description element is present	❌ Query — "Please provide the derivation algorithm for this flag"
CRF origin with a valid leafID that points to the wrong CRF page	✅ Pass — leafID resolves, PDFPageRef is syntactically valid	❌ Query — "CRF page referenced does not contain the field described"
QNAM name that is cryptic or non-intuitive (e.g., QNAM=XCFL3)	✅ Pass — QNAM ≤ 8 characters, conforms to naming rules	❌ Query — "Please clarify the meaning of XCFL3 and confirm CDISC naming convention compliance"
Derived origin with no derivation description	✅ Pass — Origin Type=Derived is valid	❌ Query — "Derivation method not documented in define.xml or referenced specifications"
QVAL values inconsistent with parent domain variables (AESLIFE=Y where AESER=N)	✅ Pass — No cross-domain logic checks in P21 at this level	❌ Query — "Logical inconsistency between SUPPAE.AESLIFE and AE.AESER for subject [ID]"

13.2 Practical Cross-Domain Checks

These are checks Pinnacle 21 does not perform but reviewers routinely validate manually. Automating them before submission eliminates a major class of queries.

SUPPQUAL Signal	Parent Domain Check	Failure Pattern
AESLIFE = 'Y'	AE.AESER should be 'Y'	Life-threatening event marked non-serious
AECONTRT = 'Y'	AE.AEREL = 'NOT RELATED'	Treatment given but event marked unrelated
AOCCIFL = 'Y'	Check earlier AESTDTC for same PT	Incorrect first-occurrence flag
SUPPLB.LBMETHOD present	LB.LBSPEC consistent	Method/specimen mismatch
SUPPAE.VISIT	SV / TV alignment	Visit naming inconsistencies

Example SAS Check: SUPPAE vs AE Consistency

/* Cross-domain QC: flag AESLIFE=Y where AE.AESER ne Y */
/* Run this before define.xml finalization — P21 will not catch it */
proc sql;
  create table ae_mismatch as
  select a.usubjid,
         a.idvarval    as aeseq,
         a.qval        as aeslife,
         b.aeser
  from suppae a
  left join ae b
    on  a.usubjid = b.usubjid
    and input(a.idvarval, best.) = b.aeseq
  where a.qnam   = 'AESLIFE'
    and a.qval   = 'Y'
    and b.aeser ne 'Y';
quit;

/* If ae_mismatch has rows: data error or documentation gap.
   Resolve before submission. Document exceptions in the
   Data Reviewer's Guide if clinically justified. */

13.3 The Practical Rule

Passing Pinnacle 21 means your data is structurally acceptable for submission intake. It does not mean your metadata is review-ready. The two gates are sequential, not equivalent. Run P21 to clear the first gate. Then review your define.xml as if you are an FDA data reviewer who has never seen this study and has 20 minutes to understand what SUPPAE contains. If you would have questions, so will the reviewer.

14. Scaling Problems in Large SUPPQUAL Domains

In real studies, SUPPLB with 40-plus QNAMs and SUPPCM with 60-plus QNAMs are not unusual. At that scale, define.xml stops being a reference document and starts being a navigation problem. This is a design failure that manifests as a metadata failure, and VLM construction alone cannot solve it.

14.1 What Goes Wrong at Scale

A ValueListDef block with 50 ItemRef entries renders slowly in define viewers and is cognitively unnavigable. QNAM names become abbreviated to the point of opacity. VLM entries that repeat the same boilerplate description across dozens of Y/N flags become indistinguishable to a reviewer scanning for meaning. The define.xml becomes accurate but useless.

The specific failure pattern in SUPPLB is worth naming. Laboratory method, specimen type, and result units are often implemented as separate QNAMs per test code — SPECBLOOD, SPECURINE, METHCHEM, METHHEMA — rather than as a single QNAM with controlled terminology. This explodes the QNAM count for no semantic gain and makes the VLM block look like noise.

14.2 What Experienced Teams Do

Before building VLM for a large SUPPQUAL domain, conduct a QNAM audit. Group all QNAMs by semantic category. If multiple QNAMs represent the same concept with different scope (specimen type by test, method by test), evaluate whether a single QNAM with controlled values covers all cases. The goal is the smallest QNAM count that captures all necessary information without ambiguity.

Keep QNAM names human-readable within the 8-character constraint. SPECTYP is better than SPCTX3. METHCD is better than MTHCD2. Reviewers read QNAM values directly in the dataset — they should not need to reference define.xml to understand what category of information a QNAM represents.

Hard truth: if your SUPPQUAL domain requires scrolling to navigate in a define viewer, it is already too complex. The define.xml is a symptom. The design is the problem.

14.3 Re-Evaluating Domain Design When Scale Grows

A SUPPCM with 60 QNAMs representing concomitant medication classification attributes is not a well-designed SUPPQUAL. It is an unstated custom domain. If the study team cannot justify why these qualifiers could not be represented as named variables in CM or a CM extension domain, that is the conversation to have before the submission package is built — not after define.xml review comes back with 30 metadata queries.

If your SUPPQUAL cannot be understood without scrolling, filtering, or cross-referencing multiple sections of define.xml, it is already too complex for efficient review. At that point you are not solving a metadata problem. You are managing the consequences of a design decision that should have been made differently.

15. Define.xml v2.0 vs v2.1 — What Changes for SUPPQUAL

Most teams treat v2.0 and v2.1 as interchangeable for SUPPQUAL work. They are not, and the differences cluster precisely around the VLM and WhereClause features that are central to SUPPQUAL metadata.

15.1 WhereClause Handling

In Define-XML v2.1, WhereClause handling is more formally specified, particularly for multi-condition expressions. The def:WhereClauseDef element in v2.1 supports cleaner namespacing and has better-defined behavior for AND semantics across multiple RangeCheck elements. If you are building complex multi-condition WhereClause entries — scoping VLM by both QNAM and IDVAR simultaneously — v2.1 is more predictable in how validators and viewers interpret the expression.

15.2 ExternalCodeList and Controlled Terminology Linkage

v2.1 introduces cleaner linkage mechanisms for external controlled terminology dictionaries, including support for NCI Thesaurus version pinning in the StandardOID attribute chain. For SUPPQUAL QNAMs that reference MedDRA, SNOMED, or LOINC values — which appears in specialized domains like SUPPDS or SUPPFA — v2.1 provides more precise and validator-checkable external codelist references.

15.3 Reviewer Tooling Compatibility

Modern FDA review tooling (JMP Clinical, the Agency's internal viewers, and the CDISC Define-XML viewer) handle v2.1 correctly. Legacy sponsor define viewers may not render v2.1 features correctly, particularly the improved WhereClause rendering. Validate your output in both the CDISC Define-XML viewer and Pinnacle 21 regardless of which version you target. If your submission standards allow v2.1, use it — but confirm with your regulatory affairs team that the target agency accepts v2.1 for the specific submission type.

Feature	v2.0 Behavior	v2.1 Behavior
Multi-condition WhereClause	Supported but ambiguously specified	Formally specified AND semantics
ExternalCodeList	Basic dictionary reference	Version-pinned, StandardOID-linked
ValueListDef scoping	Functional but verbose	Same structure, better validator support
FDA tooling acceptance	Fully accepted	Accepted; preferred for new submissions
PMDA tooling acceptance	Fully accepted	Accepted; confirm per submission type

16. Edge Cases You Will Hit

These are not hypotheticals. Every experienced SDTM programmer encounters all of them eventually. Knowing the right define.xml response in advance saves a revision cycle.

Case 1: Same QNAM Name, Different Meaning Across Studies

Not a problem within one dataset — it is a problem when you reuse a define.xml template across studies without auditing QNAM semantics. QNAM=FAST might mean "fasting status confirmed" in one study and "hours of fasting prior to sample" in another. The former is Y/N with a NY codelist. The latter is numeric stored as text with no codelist. If you port the VLM entry without reviewing the clinical meaning in the new study context, you will have a define.xml that says the wrong thing about your data.

Case 2: Numeric QVAL Stored as Text

This is extremely common. QVAL is always character in the XPT — SAS character variable, always. But some QNAMs store what is functionally a numeric value: a score, a count, a duration in hours. In define.xml, the DataType for these VLM entries should still be text to match the XPT variable type, but the Description must explicitly state that the content is numeric and document the units and expected range. Without that note, a reviewer seeing QVAL="72" has no frame of reference.

<ItemDef OID="IT.SUPPAE.QVAL.AEDURH"
         Name="AEDURH"
         DataType="text"
         Length="5"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Duration of adverse event in hours. Numeric value stored as character.
      Derived from (AEENDTC - AESTDTC) in hours, rounded to nearest integer.
      Range: 0 to 8760 (one year). Units: hours.
    </TranslatedText>
  </Description>
  <def:Origin Type="Derived"/>
</ItemDef>

Case 3: Blank vs Missing QVAL

A blank QVAL and a missing QVAL mean different things clinically. Blank can mean the question was asked and the answer was empty or not applicable. Missing in SDTM terms typically means the record should not exist. In SUPPQUAL, a row with a blank QVAL should almost never exist — if there is nothing to report for a qualifier, the row should not be created. If your dataset contains rows with blank QVAL, your VLM Description should explicitly state whether blank is a permissible value and what it signifies. Otherwise reviewers will query every blank QVAL as a potential data quality issue.

Case 4: Multiple QNAMs Representing One Concept

This arises from CRF mapping where multiple checkboxes each become their own QNAM. A concomitant medication reason-for-use form with 10 checkboxes should not produce 10 QNAMs (REASCARD, REASDIAB, REASHYP…). It should produce one QNAM (REAS) with controlled terminology values, or at most two QNAMs if the reason structure is genuinely multi-level. Each additional QNAM multiplies your VLM burden and dilutes the reviewer's ability to understand the data conceptually. Consolidation before SDTM mapping is far easier than VLM cleanup after the fact.

17. How Reviewers Actually Read SUPPQUAL

Reviewers do not read SUPPQUAL sequentially. They follow a chain: QNAM → WhereClause → VLM entry → Origin → parent domain. If any step in that chain is unclear, the result is a query.

Understanding that path is the single most useful frame for designing SUPPQUAL define.xml well. Design for it and the metadata almost writes itself.

17.1 The Reviewer Flow

A reviewer opens the SUPPQUAL dataset. They see a QNAM value — say, AESLIFE. They have one of two reactions: they know what it means immediately, or they go to define.xml. If they go to define.xml, they look at the VLM entry for AESLIFE in sequence: the Description, the Origin, the CodelistRef if present, and the PDFPageRef if the origin is CRF. They reconstruct the meaning of the variable from those four elements.

If any element is missing or vague, they stop reconstructing and start writing a query. The threshold is roughly five to ten seconds. If the meaning is not clear in that window, it becomes a formal question.

17.2 What This Means for Your Metadata

Description first. It is the primary artifact. Everything else is supporting documentation. A Description that fully explains the variable — its clinical meaning, its derivation if Derived, its permissible values if not codelist-controlled — means the reviewer may never need to open the CRF or the specs document. That is the standard to target.

Origin second. A clear, specific origin with a resolvable document reference tells the reviewer they could verify the source if they chose to. The fact that they could verify it is often enough that they do not need to.

Codelist third. If a codelist is present, the reviewer expects the data to conform to it exactly. Do not reference a codelist for a variable with values that are not in the codelist. That is worse than having no codelist reference.

Design for the reviewer who is reading your SUPPQUAL at 4 PM on a Friday after reviewing six other domains. They are not going to ask clarifying questions mentally. They are going to write queries. Every piece of define.xml that removes a potential query is time saved on both sides of the submission.

18. Cross-Domain Consistency — The Silent Check

Reviewers do not look at SUPPQUAL in isolation. They compare it against the parent domain systematically, and they compare SUPPQUAL-derived flags against analysis datasets. Logical inconsistencies between SUPPQUAL and parent domain variables are one of the most common sources of late-stage submission queries — because they require investigation to determine whether the inconsistency reflects a data error, a derivation error, or a documentation error.

18.1 Practical Cross-Domain Checks

SUPPQUAL Signal	Parent Domain Check	Failure Pattern
AESLIFE = 'Y'	AE.AESER should be 'Y'	Life-threatening event marked non-serious
AECONTRT = 'Y'	AE.AEREL = 'NOT RELATED'	Treatment given but event marked unrelated
AOCCIFL = 'Y'	Check earlier AESTDTC for same PT	Incorrect first-occurrence flag
SUPPLB.LBMETHOD present	LB.LBSPEC consistent	Method-specimen mismatch
VISIT in SUPP	SV / TV alignment	Visit naming inconsistencies

18.2 Reviewer Query Patterns

SUPPQUAL QNAM / Value	Parent Domain Variable / Expectation	Query Pattern
SUPPAE.AESLIFE = Y	AE.AESER should = Y (life-threatening implies serious)	"Subject [ID] has AESLIFE=Y in SUPPAE but AESER=N in AE. Please reconcile."
SUPPAE.AECONTRT = Y	A CM record for the AE treatment period should exist	"Concomitant treatment flagged in SUPPAE but no corresponding CM record found for subject [ID]."
SUPPLB.FAST = Y	LB.LBTPT or LB.LBTPTNUM should reflect fasting timepoint	"Fasting flag present in SUPPLB but LBTPT does not indicate fasting condition."
SUPPDS.DSSREAS (discontinuation reason)	DS.DSDECOD controlled term should align	"SUPPDS freetext reason inconsistent with DS.DSDECOD for subject [ID]."

18.2 What Define.xml Can and Cannot Do Here

Define.xml cannot prevent cross-domain logical inconsistencies — those are data quality issues. But define.xml can make the reviewer's investigation faster and less adversarial. If your SUPPAE VLM entry for AESLIFE includes a Description note stating "Expected to be consistent with AE.AESER; exceptions documented in the data reviewer's guide," you have pre-answered the question. The reviewer knows you thought about the relationship. That changes the tone of the review interaction significantly.

Add cross-domain consistency notes to VLM Descriptions for any qualifier that has a logical dependency on a parent domain variable. It adds two sentences to your metadata and removes a potential two-week query-response cycle.

19. Levels of Automation — Maturity Model

The SAS generation utility in Section 9 represents one point on a spectrum. Where a team sits on this spectrum determines how many define.xml errors they make per submission, how long define QC takes, and how much rework they absorb when datasets change late in the submission timeline.

Level	Description	Error Rate	Rework Cost When Data Changes
1 — Manual XML editing	Define.xml authored or edited by hand in a text editor or define tool UI	High — typos, OID mismatches, missed QNAMs	Very high — every QNAM change requires manual XML edits
2 — Metadata-driven generation	Define.xml generated from a metadata specs dataset; programmers edit the specs, not the XML	Medium — errors in specs propagate consistently; easier to catch and fix	Medium — update specs dataset, regenerate
3 — Dataset-derived VLM auto-build	VLM entries generated directly from the production SUPPQUAL XPT; QNAM list and lengths derived programmatically	Low — structural metadata always reflects actual data	Low — rerun the generation program against updated XPT
4 — Full validation and reconciliation framework	Automated comparison of define.xml against datasets plus cross-domain consistency checks; discrepancies flagged in a QC report	Very low — errors caught before submission regardless of late data changes	Very low — reconciliation runs catch drift automatically

Level 3 is the practical target for any team running more than two to three submissions per year. Level 4 is achievable with investment in the reconciliation tooling and is worth building if your team operates an FSP model across multiple sponsors. The metadata drift problem — where define.xml and datasets diverge after a late protocol amendment — is the most common cause of last-minute submission delays, and it is entirely preventable with Level 3 or 4 automation.

20. Bad vs Good — Full Picture

Everything in the preceding sections collapses into this single comparison. This is the difference between a define.xml that passes intake and one that survives review.

❌ The Incomplete Package

<!-- Column-level QVAL — no ValueListRef -->
<ItemDef OID="IT.SUPPAE.QVAL"
         Name="QVAL"
         DataType="text"
         Length="200"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">Result Value</TranslatedText>
  </Description>
  <!-- Missing: def:ValueListRef -->
  <def:Origin Type="Predecessor"/>
</ItemDef>

<!-- No ValueListDef block -->
<!-- No WhereClauseDef entries -->
<!-- No VLM-level ItemDefs -->

What a reviewer sees: QVAL with no VLM. Every QNAM in the dataset is undocumented at the value level. The reviewer must interpret AESLIFE, AECONTRT, AOCCIFL, and every other qualifier without metadata support. Origin=Predecessor on a structural column with no predecessor documentation. Queries incoming.

✅ The Complete Package

<!-- Column-level QVAL with ValueListRef -->
<ItemDef OID="IT.SUPPAE.QVAL"
         Name="QVAL"
         DataType="text"
         Length="200"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Result value for the supplemental qualifier identified by QNAM.
      See value-level metadata for QNAM-specific types, codelists, and origins.
    </TranslatedText>
  </Description>
  <def:ValueListRef ValueListOID="VL.SUPPAE.QVAL"/>
</ItemDef>

<!-- ValueListDef: one entry per QNAM -->
<def:ValueListDef OID="VL.SUPPAE.QVAL">
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AESLIFE"
           OrderNumber="1" Mandatory="Yes">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AESLIFE"/>
  </ItemRef>
  <ItemRef ItemOID="IT.SUPPAE.QVAL.AOCCIFL"
           OrderNumber="2" Mandatory="No">
    <def:WhereClauseRef WhereClauseOID="WC.SUPPAE.QNAM.AOCCIFL"/>
  </ItemRef>
</def:ValueListDef>

<!-- WhereClauseDefs -->
<def:WhereClauseDef OID="WC.SUPPAE.QNAM.AESLIFE">
  <RangeCheck Comparator="EQ" SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>AESLIFE</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

<def:WhereClauseDef OID="WC.SUPPAE.QNAM.AOCCIFL">
  <RangeCheck Comparator="EQ" SoftHard="Soft"
              def:ItemOID="IT.SUPPAE.QNAM">
    <CheckValue>AOCCIFL</CheckValue>
  </RangeCheck>
</def:WhereClauseDef>

<!-- VLM-level ItemDefs: AESLIFE -->
<ItemDef OID="IT.SUPPAE.QVAL.AESLIFE"
         Name="AESLIFE"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Indicator of whether the adverse event was life-threatening at the
      time of occurrence. Expected to be consistent with AE.AESER=Y;
      exceptions documented in the Data Reviewer's Guide Section 4.
    </TranslatedText>
  </Description>
  <CodeListRef CodeListOID="CL.NY"/>
  <def:Origin Type="CRF">
    <def:DocumentRef leafID="LF.ACRF">
      <def:PDFPageRef Type="NamedDestination" PageRefs="AE_SAE_PAGE"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

<!-- VLM-level ItemDefs: AOCCIFL -->
<ItemDef OID="IT.SUPPAE.QVAL.AOCCIFL"
         Name="AOCCIFL"
         DataType="text"
         Length="1"
         SASFieldName="QVAL">
  <Description>
    <TranslatedText xml:lang="en">
      Flag indicating the first chronological occurrence of an AE preferred
      term within a subject. Set to 'Y' for the first record by AESTDTC
      ascending, then AESEQ ascending. Subsequent occurrences left blank.
      Derived — no CRF source.
    </TranslatedText>
  </Description>
  <CodeListRef CodeListOID="CL.NY"/>
  <def:Origin Type="Derived">
    <def:DocumentRef leafID="LF.SDTM_SPECS">
      <def:PDFPageRef Type="NamedDestination"
                      PageRefs="SUPPAE_AOCCIFL_ALGO"/>
    </def:DocumentRef>
  </def:Origin>
</ItemDef>

The difference is not volume. The complete version has more XML, but every element is doing specific work. The incomplete version has almost no XML and none of it is useful. A reviewer reading the complete version can understand AESLIFE and AOCCIFL in under ten seconds each. A reviewer reading the incomplete version has to write queries to get that same information. That is the operational definition of a well-built SUPPQUAL define package versus a broken one.

Quick Reference Mental Model

Every SUPPQUAL VLM entry follows this chain. If any link is broken or missing, the reviewer cannot interpret the variable.

  QNAM value in dataset
       │
       ▼
  WhereClauseDef  ──── scopes the VLM entry to this QNAM
       │
       ▼
  VLM ItemDef (IT.SUPPXX.QVAL.QNAMNAME)
       ├── Description    ◄── clinical meaning + derivation note
       ├── DataType/Length◄── actual data characteristics
       ├── CodeListRef    ◄── controlled values (if applicable)
       └── Origin         ◄── CRF page / Derived method / Assigned source

Build this chain for every QNAM in every SUPPQUAL domain, make each link explicit, and the define.xml review writes itself.

Final Thought

SUPPQUAL define.xml complexity is a direct reflection of the architectural tradeoff CDISC made when they designed the supplemental qualifier model. A fixed eight-column structure that can carry arbitrary qualifiers is elegant for dataset design. For metadata, it is a nightmare — because the metadata burden that you would normally spread across named variables all collapses into one column (QVAL) and one discriminator (QNAM), and define.xml has to reconstruct that variable-level meaning through VLM machinery that most programmers interact with infrequently enough to make mistakes every time.

The way to get good at this is to build it wrong once, get the rejection letter, understand exactly which element was missing or malformed, and then build the QC scaffolding that prevents that specific error from ever occurring again. The checklist above is the aggregate of those failures across multiple programs. It saves you from learning each one the hard way.

SUPPQUAL is not documentation of data. It is reconstruction of variables. If define.xml does not make those variables obvious, the reviewer will stop and ask you to explain them.

Tags: define.xml SUPPQUAL value-level metadata VLM SDTM FDA submission PMDA WhereClauseDef QNAM IDVAR reviewer alignment Pinnacle 21 automation cross-domain regulatory SAS

Dynamic SUPPQUAL Generation Using Metadata-Driven SAS Macros

2026-04-03T13:15:00.007-04:00

If you have managed SDTM deliverables across multiple studies at the same time, you already know what happens to SUPPQUAL programs. You start with one clean macro per domain. Then a protocol amendment adds three new supplemental variables to AE. A PMDA query forces a second QNAM into EX. The DMC needs something non-standard from LB. Six months later you have domain-specific programs with no shared logic, no central control, and every change requires manual updates across multiple files.

The metadata-driven approach fixes that. One control dataset. One macro family. All SUPP domains generated in a single call. This post walks through the full design, control dataset structure, macro architecture, the edge cases that cause real production issues, and the validation checks you should run before submission.

The Problem with Hard-Coded SUPPQUAL Programs

Hard-coded programs break in predictable ways. A QNAM gets added mid-study, someone has to find the right program, understand its structure, add the variable, and QC the whole thing again. An amendment renames a CRF field, IDVARVAL logic breaks silently because the source variable changed. A programmer unfamiliar with the original design adds a duplicate QNAM to a different SUPP domain by mistake. None of these are dramatic failures. They are the slow buildup of technical debt across a submission package.

The deeper issue is that SUPPQUAL generation is structurally the same across parent domains. For every QNAM, you are doing the same thing, pulling a variable from a source dataset, converting it to character, attaching the right RDOMAIN, linking it back to the parent record through IDVAR and IDVARVAL, and writing it in long form with the correct QNAM, QLABEL, QORIG, and QEVAL assignments. The logic does not change. Only the metadata does.

SUPPQUAL Structure, The Part That Trips People

Most experienced programmers know the basic structure. But a few details still cause submission defects and are worth stating clearly before getting into the macro design.

IDVARVAL is always character. Sequence variables in parent domains such as AESEQ, CMSEQ, LBSEQ, and EXSEQ are numeric in the parent dataset. IDVARVAL in the SUPP dataset must be the character version of that value. This conversion is straightforward, put(AESEQ, best32.) gives you "1", "2", "3", but it must be handled explicitly. If the source dataset uses a custom format on the sequence variable, you want the raw numeric-to-character conversion, not the formatted value. The macro should detect IDVAR type at runtime and apply best32. for numeric IDVARs.

IDVAR and IDVARVAL are blank for one-record-per-subject domains. The DM domain has one record per subject. SUPPDM therefore has no meaningful IDVAR, so both IDVAR and IDVARVAL are left blank. Per SDTMIG Section 8.4, this is valid when there is only one record per subject in the related domain.

Records with missing QVAL must be excluded. Supplemental qualifier records with no data value should not be carried into the final SUPP dataset. Missing QVAL is a common validation issue and should be filtered out during generation.

Variable	Type	Length	Key Notes
STUDYID	Char	≤200	Copied from parent domain
RDOMAIN	Char	2	Parent domain abbreviation
USUBJID	Char	≤200	Copied from parent domain
IDVAR	Char	8	Sequence variable name, blank for one-record-per-subject domains
IDVARVAL	Char	≤200	Always character, blank when IDVAR is blank
QNAM	Char	8	Must start with a letter, uppercase, alphanumeric, 8 characters or less
QLABEL	Char	≤40	Descriptive label for QNAM
QVAL	Char	≤200	Always character, records with missing values excluded
QORIG	Char	≤200	CRF, Derived, Assigned, and related values per study standard
QEVAL	Char	8	Usually blank unless evaluator is required

QNAM naming: QNAM must be 8 characters or less, start with a letter, and contain only letters and digits. Do not rely on output truncation to save you. Validate QNAM values in the control dataset before generation.

The Control Dataset Design

The control dataset is the single source of truth for SUPPQUAL generation. Every QNAM you want to produce corresponds to one row. The control dataset should be maintained like any other programming spec and version-controlled with the study codebase.

Below is a practical variable structure for SUPPQUAL_META.

Variable	Type	Purpose
RDOMAIN	Char(2)	Parent domain such as AE, CM, LB, EX, DM
QNAM	Char(8)	Supplemental qualifier name
QLABEL	Char(40)	Label for QNAM
SRC_DS	Char(41)	Source dataset, libname.memname or memname
SRC_VAR	Char(32)	Source variable in SRC_DS
IDVAR	Char(8)	Linking variable in SRC_DS, blank for DM-style domains
QORIG	Char(200)	Origin value
QEVAL	Char(8)	Evaluator, usually blank
SRC_ISNUM	Char(1)	Y if source variable is numeric, N if character
SRC_FMT	Char(32)	Optional numeric format for QVAL conversion
ACTIVATE	Char(1)	Y to include, N to skip

A simple example:

/* Example: SUPPQUAL_META contents */
data SUPPQUAL_META;
  length rdomain $2 qnam $8 qlabel $40 src_ds $41 src_var $32
         idvar $8 qorig $200 qeval $8 src_isnum $1 src_fmt $32
         activate $1;
  infile datalines dlm='|' truncover;
  input rdomain $ qnam $ qlabel $ src_ds $ src_var $
        idvar $ qorig $ qeval $ src_isnum $ src_fmt $ activate $;
datalines;
AE|AECNTRY|Country Where Event Occurred|WORK.AE_WORK|AECNTRY|AESEQ|CRF||N||Y
AE|AEREL2|Causality Second Study Drug|WORK.AE_WORK|CAUSDG2|AESEQ|CRF||N||Y
AE|AESLIFE|Resulted in Life Threatening|WORK.AE_WORK|SLIFE|AESEQ|CRF||N||Y
CM|CMDOSU2|Secondary Dose Unit|WORK.CM_WORK|DOSU2|CMSEQ|CRF||N||Y
CM|CMROUTE2|Secondary Route|WORK.CM_WORK|ROUTE2|CMSEQ|CRF||N||Y
DM|DMHISP|Hispanic or Latino|WORK.DM_WORK|HISPANIC||CRF||N||Y
EX|EXLOC|Location of Administration|WORK.EX_WORK|LOCATION|EXSEQ|CRF||N||Y
LB|LBANMETH|Analysis Method|WORK.LB_WORK|ANMETH|LBSEQ|CRF||N||Y
LB|LBDSTRESC|Derived Result Char|WORK.LB_WORK|DSTR|LBSEQ|Derived||N||Y
VS|VSMETHOD|Method of Measurement|WORK.VS_WORK|METHOD|VSSEQ|CRF||N||N
;
run;

DM handling: Because DM has one record per subject, SUPPDM should have blank IDVAR and IDVARVAL. That is valid and should not be treated as missing metadata.

The Macro Architecture

This design uses two macros. A helper macro, %_supp_one, handles one QNAM row at a time. A driver macro, %gen_suppqual, reads the control dataset, calls the helper for each active row, stacks the results, and writes one SUPP dataset per parent domain.

That split matters. The helper is easier to test in isolation. The driver handles orchestration without mixing in domain-level data logic.

Assumption: Every source dataset listed in SRC_DS must already contain STUDYID and USUBJID. Do not point this macro directly at raw collection datasets that have not been standardized.

Truncation risk: QLABEL is defined as $40 and QVAL as $200. SAS truncates silently on assignment when the incoming value is longer. If your source variable can hold long free text, validate lengths before generation.

Helper Macro: %_supp_one

/* ============================================================
   %_SUPP_ONE
   Internal helper. Processes one row from SUPPQUAL_META.
   Creates WORK._SUPP_&i. as output.
   ============================================================ */
%macro _supp_one(
  i  = ,
  rd = ,
  qn = ,
  ql = ,
  sd = ,
  sv = ,
  iv = ,
  qo = CRF,
  qe = ,
  sn = N,
  sf =
);

  %local _dsid_ _varid_ _ivid_ _ivtype_ _qn_len_;

  /* 1. Validate source dataset exists */
  %let _dsid_ = %sysfunc(open(&sd.));
  %if &_dsid_. = 0 %then %do;
    %put WARNING: [_supp_one] Dataset &sd. not found. Skipping QNAM=&qn. (RDOMAIN=&rd.);
    %return;
  %end;

  /* 2. Validate source variable exists */
  %let _varid_ = %sysfunc(varnum(&_dsid_., &sv.));
  %if &_varid_. = 0 %then %do;
    %let _dsid_ = %sysfunc(close(&_dsid_.));
    %put WARNING: [_supp_one] Variable &sv. not found in &sd.. Skipping QNAM=&qn. (RDOMAIN=&rd.);
    %return;
  %end;

  /* 3. Validate IDVAR and detect its type */
  %let _ivtype_ = C;
  %if %sysfunc(lengthn(%sysfunc(strip(&iv.)))) > 0 %then %do;
    %let _ivid_ = %sysfunc(varnum(&_dsid_., &iv.));
    %if &_ivid_. = 0 %then %do;
      %let _dsid_ = %sysfunc(close(&_dsid_.));
      %put WARNING: [_supp_one] IDVAR=&iv. not found in &sd.. Skipping QNAM=&qn. (RDOMAIN=&rd.);
      %return;
    %end;
    %let _ivtype_ = %sysfunc(vartype(&_dsid_., &_ivid_.));
  %end;

  %let _dsid_ = %sysfunc(close(&_dsid_.));

  /* 4. Guard QNAM length in macro context before DATA step compile */
  %let _qn_len_ = %length(%sysfunc(strip(%upcase(&qn.))));
  %if &_qn_len_. > 8 %then %do;
    %put WARNING: [_supp_one] QNAM=&qn. is &_qn_len_. characters and exceeds the 8-character limit.;
    %put WARNING: [_supp_one] Output will be truncated. Fix SUPPQUAL_META before submission.;
  %end;

  /* 5. Build one QNAM contribution dataset */
  data _supp_&i_.
    (keep=studyid rdomain usubjid idvar idvarval qnam qlabel qval qorig qeval);

    length
      studyid  $200
      rdomain  $2
      usubjid  $200
      idvar    $8
      idvarval $200
      qnam     $8
      qlabel   $40
      qval     $200
      qorig    $200
      qeval    $8;

    set &sd. (keep=studyid usubjid
                   %if %sysfunc(lengthn(%sysfunc(strip(&iv.)))) > 0 %then &iv.;
                   &sv.);

    rdomain = "&rd.";
    idvar   = "%sysfunc(strip(&iv.))";
    qnam    = "%upcase(%sysfunc(strip(&qn.)))";
    qlabel  = "&ql.";
    qorig   = "&qo.";
    qeval   = "&qe.";

    %if %sysfunc(lengthn(%sysfunc(strip(&iv.)))) > 0 %then %do;
      %if &_ivtype_. = N %then %do;
        idvarval = strip(put(&iv., best32.));
      %end;
      %else %do;
        idvarval = strip(&iv.);
      %end;
    %end;
    %else %do;
      idvarval = '';
    %end;

    %if %upcase(&sn.) = Y %then %do;
      %if %sysfunc(lengthn(%sysfunc(strip(&sf.)))) > 0 %then %do;
        qval = strip(put(&sv., &sf..));
      %end;
      %else %do;
        qval = strip(put(&sv., best32.));
      %end;
    %end;
    %else %do;
      qval = strip(&sv.);
    %end;

    if missing(qval) then delete;
  run;

%mend _supp_one;

A few details here matter. %sysfunc(vartype()) detects whether IDVAR is numeric or character before the DATA step runs. That matters because most sequence variables are numeric in the parent dataset but must be written as character in IDVARVAL. Using put(AESEQ, best32.) gives a clean unformatted value. Using vvalue() would return the formatted value and can create the wrong result when custom formats exist.

The QNAM length guard lives in macro context for a reason. Once QNAM is assigned to a DATA step variable declared as $8, SAS truncates it immediately and silently. Any later length(qnam) check in the DATA step is too late, the value has already been shortened. The macro check catches the original full control-dataset value before compilation.

The missing(qval) delete is not optional. If the qualifier has no data value, it should not be written out as a SUPP record.

Driver Macro: %gen_suppqual

/* ============================================================
   %GEN_SUPPQUAL
   Driver macro. Reads SUPPQUAL_META, calls %_supp_one for
   each row, stacks results, and writes one SUPP dataset per
   parent domain to OUTLIB.
   ============================================================ */
%macro gen_suppqual(
  meta   = WORK.SUPPQUAL_META,
  outlib = WORK
);

  %local _nrows_ _i_ _any_created_ _domains_ _nd_ _d_ _dom_ _dsid_ _nobs_;

  /* 1. Validate metadata dataset exists */
  %if not %sysfunc(exist(&meta.)) %then %do;
    %put ERROR: [gen_suppqual] Control dataset &meta. not found. Macro aborted.;
    %return;
  %end;

  /* 2. Count active rows */
  proc sql noprint;
    select count(*) into :_nrows_ trimmed
    from &meta.
    where not missing(rdomain)
      and not missing(qnam)
      and not missing(src_ds)
      and not missing(src_var)
      and upcase(coalescec(activate,'Y')) = 'Y';
  quit;

  %if &_nrows_. = 0 %then %do;
    %put WARNING: [gen_suppqual] No valid rows found in &meta.. Macro aborted.;
    %return;
  %end;

  %put NOTE: [gen_suppqual] Found &_nrows_. active QNAM entries to process.;

  /* 3. Load metadata into macro variable arrays */
  %do _i_ = 1 %to &_nrows_.;
    %local _rd&_i_. _qn&_i_. _ql&_i_. _sd&_i_. _sv&_i_.
           _iv&_i_. _qo&_i_. _qe&_i_. _sn&_i_. _sf&_i_.;
  %end;

  proc sql noprint;
    select
      strip(rdomain),
      strip(qnam),
      strip(qlabel),
      strip(src_ds),
      strip(src_var),
      strip(coalescec(idvar,'')),
      strip(coalescec(qorig,'CRF')),
      strip(coalescec(qeval,'')),
      strip(coalescec(src_isnum,'N')),
      strip(coalescec(src_fmt,''))
    into
      :_rd1 - :_rd&_nrows_.,
      :_qn1 - :_qn&_nrows_.,
      :_ql1 - :_ql&_nrows_.,
      :_sd1 - :_sd&_nrows_.,
      :_sv1 - :_sv&_nrows_.,
      :_iv1 - :_iv&_nrows_.,
      :_qo1 - :_qo&_nrows_.,
      :_qe1 - :_qe&_nrows_.,
      :_sn1 - :_sn&_nrows_.,
      :_sf1 - :_sf&_nrows_.
    from &meta.
    where not missing(rdomain)
      and not missing(qnam)
      and not missing(src_ds)
      and not missing(src_var)
      and upcase(coalescec(activate,'Y')) = 'Y'
    order by rdomain, qnam;
  quit;

  /* 4. Call helper macro */
  %do _i_ = 1 %to &_nrows_.;
    %_supp_one(
      i  = &_i_.,
      rd = &&_rd&_i_.,
      qn = &&_qn&_i_.,
      ql = &&_ql&_i_.,
      sd = &&_sd&_i_.,
      sv = &&_sv&_i_.,
      iv = &&_iv&_i_.,
      qo = &&_qo&_i_.,
      qe = &&_qe&_i_.,
      sn = &&_sn&_i_.,
      sf = &&_sf&_i_.
    );
  %end;

  /* 5. Check output existence */
  %let _any_created_ = 0;
  %do _i_ = 1 %to &_nrows_.;
    %if %sysfunc(exist(work._supp_&_i_.)) %then %let _any_created_ = 1;
  %end;

  %if &_any_created_. = 0 %then %do;
    %put WARNING: [gen_suppqual] No contribution datasets created. Check source datasets and variable names.;
    %return;
  %end;

  /* 6. Stack all contributions */
  data _supp_all_;
    set
    %do _i_ = 1 %to &_nrows_.;
      %if %sysfunc(exist(work._supp_&_i_.)) %then work._supp_&_i_.;
    %end;
    ;
  run;

  /* 7. Get list of represented domains */
  proc sql noprint;
    select distinct strip(rdomain) into :_domains_ separated by '|'
    from _supp_all_
    where not missing(rdomain);

    select count(distinct rdomain) into :_nd_ trimmed
    from _supp_all_
    where not missing(rdomain);
  quit;

  /* 8. Write one SUPP dataset per domain */
  %do _d_ = 1 %to &_nd_.;
    %let _dom_ = %scan(&_domains_., &_d_., |);

    proc sort
      data=_supp_all_(where=(rdomain="&_dom_."))
      out=&outlib..supp&_dom_.(label="Supplemental Qualifiers for %upcase(&_dom_.)");
      by studyid rdomain usubjid idvar idvarval qnam;
    run;

    %let _dsid_ = %sysfunc(open(&outlib..supp&_dom_.));
    %let _nobs_ = %sysfunc(attrn(&_dsid_., nobs));
    %let _dsid_ = %sysfunc(close(&_dsid_.));

    %put NOTE: [gen_suppqual] &outlib..SUPP&_dom_. created with &_nobs_. observations.;
  %end;

  /* 9. Clean up */
  proc datasets lib=work nolist nowarn;
    delete _supp_:;
  quit;

  %put NOTE: [gen_suppqual] Complete. &_nd_. SUPP dataset(s) written to &outlib..;

%mend gen_suppqual;

Calling the Macro

Once the control dataset is ready and the macros are compiled, generation becomes a single call.

%validate_meta(meta=SDTM.SUPPQUAL_META);
%gen_suppqual(meta=SDTM.SUPPQUAL_META, outlib=SDTM);

The SAS log should show one NOTE per created SUPP dataset along with the observation count. Any source-dataset or source-variable mismatches should produce clear WARNING messages that point back to the relevant QNAM and RDOMAIN.

Handling Numeric QVAL Variables

Most SUPPQUAL values start as character variables. But some studies need numeric source variables carried into QVAL. The combination of SRC_ISNUM and SRC_FMT handles that cleanly.

data SUPPQUAL_META;
  set SUPPQUAL_META;
  if qnam = 'LBCALC' then do;
    src_isnum = 'Y';
    src_fmt   = '8.3';
  end;
run;

If SRC_ISNUM=Y and SRC_FMT is blank, the macro uses best32.. If you need fixed decimal places preserved in QVAL, provide an explicit numeric format in the metadata.

Decimal display: best32. will turn 2.0 into "2", not "2.0". If decimal presentation matters, set SRC_FMT to a value such as 8.2.

Pinnacle 21 and Metadata Validation

Correct macro output is not enough. You should validate the metadata before generation and check the output after generation.

A practical validation macro is shown below. It checks QNAM length, QNAM naming pattern, uppercase consistency, and missing required fields before the driver macro runs.

/* ============================================================
   %VALIDATE_META
   Run before %GEN_SUPPQUAL. Aborts on any failure.
   ============================================================ */
%macro validate_meta(meta=WORK.SUPPQUAL_META, outlib=WORK);

  proc sql noprint;
    create table &outlib.._meta_errors as

    select 'SD0083' as check_id, rdomain, qnam,
           'QNAM exceeds 8 characters' as description
    from &meta.
    where length(strip(qnam)) > 8

    union all

    select 'SD0082' as check_id, rdomain, qnam,
           'QNAM fails naming convention' as description
    from &meta.
    where not prxmatch('/^[A-Za-z][A-Za-z0-9]{0,7}$/', strip(qnam))

    union all

    select 'FORMAT' as check_id, rdomain, qnam,
           'QNAM contains lowercase and should be uppercase in metadata' as description
    from &meta.
    where qnam ne upcase(qnam)

    union all

    select 'MISSING' as check_id, rdomain, qnam,
           cats('Missing required field: ',
                case
                  when missing(src_ds) then 'SRC_DS'
                  when missing(src_var) then 'SRC_VAR'
                  else 'RDOMAIN or QNAM'
                end) as description
    from &meta.
    where missing(src_ds)
       or missing(src_var)
       or missing(rdomain)
       or missing(qnam);
  quit;

  %local _nerr_;
  proc sql noprint;
    select count(*) into :_nerr_ trimmed
    from &outlib.._meta_errors;
  quit;

  %if &_nerr_. > 0 %then %do;
    proc print data=&outlib.._meta_errors noobs; run;
    %put ERROR: [validate_meta] &_nerr_. validation issue(s) found in &meta..;
    %put ERROR: [validate_meta] Fix metadata before calling %nrstr(%gen_suppqual).;
    %abort cancel;
  %end;
  %else %do;
    %put NOTE: [validate_meta] Metadata checks passed.;
  %end;

%mend validate_meta;

You should also run a post-generation duplicate check. Duplicate QNAM values for the same parent record are a common structural issue when the source working dataset has not been deduplicated correctly.

proc sort data=SDTM.SUPPAE nodupkey dupout=_dups_;
  by studyid rdomain usubjid idvar idvarval qnam;
run;

%if %sysfunc(exist(_dups_)) %then %do;
  %local _dsid_;
  %let _dsid_ = %sysfunc(open(_dups_));
  %if %sysfunc(attrn(&_dsid_., nobs)) > 0 %then
    %put WARNING: Duplicate QNAM records found in SUPPAE. Review source data.;
  %let _dsid_ = %sysfunc(close(&_dsid_));
%end;

Extending the Approach

This macro assumes the source variable lives in the same dataset that already contains STUDYID, USUBJID, and the linking IDVAR. That is the cleanest pattern and the one most teams should aim for. If your supplemental variable lives elsewhere and needs to be joined back to the parent domain to obtain the sequence number, do that in a preprocessing step. Keep the SUPPQUAL macro focused on extraction and formatting, not on dataset joins.

The same metadata can also help feed define.xml generation. QNAM, QLABEL, QORIG, and QEVAL already exist in the control dataset, so you are not maintaining the same information in two different places.

Summary

A metadata-driven SUPPQUAL framework does not remove complexity. It moves it into one controlled dataset where it is visible, testable, and easier to maintain. The macros shown here handle the main technical requirements, IDVAR type detection, numeric-to-character conversion, blank IDVAR handling for one-record-per-subject domains, QVAL exclusion when missing, and output sorted into submission-ready SUPP datasets.

The biggest payoff comes when the specification changes. Adding a new QNAM becomes a metadata change, not a program rewrite.

For regulated studies, keep these macros in a controlled and versioned macro library. Do not pull production macros from ad hoc URLs at runtime. The code used for submission work should come from a locked, auditable source.

Source-aligned with SDTMIG v3.3 Section 8.4. Numeric-to-character IDVARVAL conversion using best32. is consistent with SAS 9.4 behavior. Validation outcomes should still be checked against the Pinnacle 21 version defined in the study validation plan.

Tags: SDTM, SUPPQUAL, SAS Macros, Metadata-Driven Programming, Define.xml, Pinnacle 21, Regulatory Submissions

Cross-domain SDTM QC Checks You Should Automate, With SAS Snippets

2026-04-01T13:12:00.004-04:00

SDTM Programming · SAS · Submission QC · Pinnacle 21 · Define.xml

Most SDTM QC still stops at domain‑level review and Pinnacle 21 output.

That leaves a big gap.

A study can be structurally clean and still fail basic cross‑domain logic. AE timing can conflict with EX. Death can exist in DM without a matching DS record. RFSTDTC can disagree with the earliest exposure date. None of that is rare. None of it should be left to manual review.

If you want stronger SDTM, automate the checks that validate how domains work together, not just whether each domain looks correct in isolation.

These are not meant to replace protocol review, medical review, or P21. They are meant to catch the quiet, cross‑domain failures that sit between them.

Why Cross-domain QC Matters

P21 validates conformance.

Cross‑domain QC validates coherence.

That is the difference between:

a dataset that follows SDTM rules
and a dataset that actually represents the study correctly

The most expensive issues in submission work usually live in that second category.

These are not edge cases. This is where most real submission issues live.

A Good Rule Before Writing Any Check

Chronology: Did events happen in a clinically possible order?
Consistency: Do anchor variables agree across domains?
Completeness: Is a required partner record present somewhere else?
Traceability: Does a claimed relationship actually resolve?

If a check does not answer one of those questions, it is probably better handled elsewhere.

1. AE Start Date Before First Exposure

This is one of the most useful cross‑domain checks. An AE that starts before first exposure is not always wrong — it may reflect pre‑treatment conditions or prior medical history that the protocol is capturing — but it should always be flagged and reviewed.

proc sql;
  create table ex_first as
  select usubjid,
         min(input(exstdtc, ?? is8601da.)) as first_exdt format=date9.
  from sdtm.ex
  where not missing(exstdtc)
  group by usubjid;
quit;

proc sql;
  create table qc_ae_before_ex as
  select a.usubjid, a.aeseq, a.aestdtc, e.first_exdt
  from sdtm.ae as a
  left join ex_first as e
    on a.usubjid = e.usubjid
  where input(a.aestdtc, ?? is8601da.) < e.first_exdt;
quit;

Issues that show up here often include bad date mapping, wrong reference dates, or AE‑related timing logic that clashes with protocol‑defined treatment‑emergent assumptions.

2. DM RFSTDTC vs Earliest EXSTDTC

If RFSTDTC is the date of first study treatment, it should match the earliest exposure date in EX for that subject. When they disagree, every downstream day‑based calculation built from RFSTDTC becomes suspect.

proc sql;
  create table ex_first as
  select usubjid,
         min(input(exstdtc, ?? is8601da.)) as first_exdt format=date9.
  from sdtm.ex
  where not missing(exstdtc)
  group by usubjid;
quit;

data qc_dm_rfstdtc_mismatch;
  set sdtm.dm;
  rfstdt = input(rfstdtc, ?? is8601da.);
run;

proc sql;
  create table qc_dm_rfstdtc_mismatch as
  select d.usubjid, d.rfstdtc, e.first_exdt format=date9.
  from qc_dm_rfstdtc_mismatch as d
  left join ex_first as e
    on d.usubjid = e.usubjid
  where not missing(d.rfstdt) and not missing(e.first_exdt)
    and d.rfstdt ne e.first_exdt;
quit;

3. DM Death Without DS Record

If DTHFL = "Y" or DTHDTC is populated in DM, there should usually be a corresponding DS record that reflects the subject’s disposition tied to death. A mismatch undermines the death-disposition narrative for the reviewer.

proc sql;
  create table qc_dm_death_no_ds as
  select d.usubjid
  from sdtm.dm as d
  left join sdtm.ds as s
    on d.usubjid = s.usubjid
  where upcase(d.dthfl) = "Y"
    and missing(s.usubjid);
quit;

4. LB Outside Study Window

Lab records can look fine in isolation, but they become problematic when placed against the subject’s actual study window from DM (RFSTDTC to RFENDTC). This check catches labs that are genuinely out‑of‑window, mis‑mapped, or pulled into SDTM by mistake.

proc sql;
  create table dm_window as
  select usubjid,
         input(rfstdtc, ?? is8601da.) as rfstdt format=date9.,
         input(rfendtc, ?? is8601da.) as rfendt format=date9.
  from sdtm.dm;
quit;

proc sql;
  create table qc_lb_outside_window as
  select l.usubjid, l.lbseq, l.lbdtc,
         d.rfstdt, d.rfendt
  from sdtm.lb as l
  left join dm_window as d
    on l.usubjid = d.usubjid
  where not missing(l.lbdtc)
    and (
         input(l.lbdtc, ?? is8601da.) < d.rfstdt
         or
         (not missing(d.rfendt) and input(l.lbdtc, ?? is8601da.) > d.rfendt)
        );
quit;

5. RELREC Link Resolution

A RELREC record is only useful if the referenced record actually exists. This is a classic cross‑domain traceability check. Broken RELREC links create false confidence rather than real traceability. This is one of the few checks where structure can be completely correct and still be functionally useless.

%macro check_relrec(domain=, idvar=);
  proc sql;
    create table qc_relrec_&domain as
    select r.usubjid, r.rdomain, r.idvar, r.idvarval
    from sdtm.relrec as r
    left join sdtm.&domain as d
      on r.usubjid = d.usubjid
     and input(r.idvarval, best.) = d.&idvar
    where upcase(r.rdomain) = "%upcase(&domain)"
      and upcase(r.idvar) = "%upcase(&idvar)"
      and missing(d.usubjid);
  quit;
%mend;

%check_relrec(domain=ae, idvar=aeseq);
%check_relrec(domain=cm, idvar=cmseq);

6. VLM Coverage Gaps

If a value appears in the data but has no matching value-level metadata entry, that is a real documentation gap, even when the define.xml is technically valid.

proc sql;
  create table suppae_qnam_data as
  select distinct upcase(qnam) as qnam
  from sdtm.suppae;

  create table suppae_qnam_meta as
  select distinct upcase(value) as qnam
  from meta.vlm
  where upcase(dataset) = "SUPPAE"
    and upcase(variable) = "QNAM";

  create table qc_suppae_qnam_vlm_gap as
  select d.qnam
  from suppae_qnam_data as d
  left join suppae_qnam_meta as m
    on d.qnam = m.qnam
  where missing(m.qnam);
quit;

Building These Checks Into a Reusable SAS QC Framework

These checks are most useful when packaged as a shared macro library and QC findings layer used across studies, not as one‑off snippets in individual program folders.

You can build them as:

standard macros (e.g., %check_relrec, %check_ae_ex_timing)
standard QC datasets (one per check)
a standard reporting layer that appends all findings into a single QC findings dataset

Once that structure is in place, each new study can reuse the same logic with little customization.

What You Should Automate First

If you are starting from scratch, automate these first:

DM‑to‑EX date anchor checks (RFSTDTC vs earliest EXSTDTC)
AE‑to‑EX chronology (AE before first exposure)
DM‑to‑DS death consistency
LB‑to‑DM timeline (LB outside study window)
RELREC link resolution
VLM coverage for key variables

These give you the fastest return on investment and form the backbone of a cross‑domain QC framework.

What Automation Will Not Solve

mapping correctness
protocol interpretation
clinical judgment

Automation cannot replace those decisions. It can, however, force them into the open early by surfacing the inconsistencies they create.

Bottom Line

Clean domains are not enough. The domains have to agree with each other.

Cross‑domain QC is where SDTM moves from compliant to credible.

```

The Illusion of P21 Clean: Why Passing Validation Is Not Enough

2026-04-01T12:11:00.003-04:00

Most SDTM teams still treat a clean run in Pinnacle 21 Enterprise as the finish line. It isn’t.

It tells you one thing: your datasets passed a rule-based conformance check aligned to published standards such as the SDTM model, SDTM IG, controlled terminology, and define.xml schema.

It does not tell you:

if the data is clinically interpretable,
if the relationships across domains make sense,
if a reviewer can actually use the package without stopping to question it.

That gap matters most in SDTM, because SDTM is the base layer of the submission. If SDTM distorts the study, everything downstream inherits that distortion, including define.xml, reviewer traceability, and the regulatory review itself.

P21 clean means the package passed rules. It does not mean the package is correct.

This is not a knock on Pinnacle 21 Enterprise. It is an essential tool. But essential and sufficient are not the same thing. Teams that treat them as the same end up confusing conformance with quality, and that is where avoidable submission risk starts.

What P21 Actually Does

At its core, P21 is a conformance checker. It verifies that datasets and define.xml align with published CDISC standards. In practice, that means it checks things like dataset structure, required variable presence, type and length expectations, controlled terminology membership, schema validity for define.xml, and selected referential consistency such as STUDYID or USUBJID alignment.

Dataset structure: required variables, expected data types, labels, and special-purpose domain formatting
Controlled terminology compliance: whether submitted values appear in the expected codelist
Define.xml schema validity: whether the metadata package is structurally valid under define.xml 2.0 or 2.1
Selective referential checks: subject and study identifiers, some foreign key relationships
Metadata completeness at a structural level: required datasets, SUPPQUAL structure, standard references

That work is necessary. A P21-clean package is structurally safer than one that fails basic conformance. But the real review question is not, “Does this fit the standard?” It is, “Does this represent the study correctly?”

P21 answers the first question. Reviewers care about the second.

Where P21 Stays Silent

This is where many real submission problems live.

1. Clinical Logic and Timeline Plausibility

P21 has no protocol awareness. It does not know that an adverse event start date before first dose may be impossible in context, or that a death flag in DM should usually line up with a death disposition record in DS, or that a 12‑week study should not show impossible subject‑level date sequences.

Take a simple example. A subject has AESTDTC = 2021-03-01 and EXSTDTC = 2021-04-15. That AE record can be structurally perfect and still raise immediate concern in review. Same for informed consent after first dose, fatal AE outcomes paired with completion disposition, or lab collection dates that sit outside the subject’s actual study window.

Those are not formatting failures. They are study representation failures.

2. EX Can Be Valid and Still Misrepresent Treatment

EX is where reviewers rebuild dosing history. P21 can tell you EX is structurally valid. It cannot tell you whether EX still reflects what actually happened.

Dosing interruptions flattened: a two‑week treatment hold is absorbed into one continuous record
Dose reductions collapsed: multiple dosing episodes become one final‑dose record
Administration detail lost: cycle‑level summaries replace administration‑level records in settings where timing matters

Once EX is flattened, downstream review breaks. AE timing, dose intensity, and treatment relationship all become harder to reconstruct, even though the dataset still passes validation.

3. LB Often Passes While the Standardization Is Wrong

LB is one of the easiest domains to make look clean. It is also one of the easiest places to hide quiet failures.

Reference ranges mismatch standardized units: LBSTRESN is converted, but LBSTNRLO and LBSTNRHI stay in the original unit
Cross‑vendor inconsistency: the same LBTESTCD is normalized differently across sites or lab vendors

Reviewers use LB heavily for safety review. If the units, ranges, or normalization logic are inconsistent, the problem is methodological, not structural. P21 will not rescue you from that.

4. Traceability and Domain Design Failures

Some of the most damaging submission issues are not dataset‑format issues at all. They are design and traceability issues.

SUPPQUAL is a common example. A SUPP-- domain can be perfectly valid while still being overloaded with clinically important qualifiers that should have stayed in the parent domain. When reviewers must manually reconstruct central interpretation variables by merging supplemental qualifiers back into AE or another parent domain, the design has already failed its reader.

The same thing happens when mapping intent and mapping outcome drift apart. A procedure ends up modeled like an event in the wrong class. A topic variable holds the wrong kind of concept. The record is valid in shape but wrong in meaning. Reviewers do not experience that as a standards issue. They experience it as untrustworthy data.

RELREC has the same risk. It may be structurally sound while still pointing to nonexistent records, incomplete relationships, or clinically meaningless links. A technically valid relationship structure that nobody can follow is not doing its job.

5. Trial Design and Define.xml Coverage Gaps

Trial design domains and define.xml often look better in validation output than they do in actual review.

P21 can confirm that TA, TE, TV, TI, and TS have the right structure. It cannot confirm that the arm design, element order, visit structure, or epoch assumptions actually reflect the protocol and what subjects experienced.

The same applies to define.xml. Schema‑valid does not mean review‑ready.

Missing value‑level metadata coverage: actual QNAM, LBTESTCD, or VSTESTCD values appear in data but not in VLM
Weak origin documentation: variables are tagged Origin = "Derived" with no useful computational method
Metadata that documents structure but not logic: technically valid, but not enough for reviewer traceability

A define.xml that passes schema checks but does not explain what the reviewer needs to understand is still a weak define.xml.

6. Controlled Terminology Can Be Right on Paper and Wrong in Context

P21 checks whether a value exists in the expected codelist. It does not know whether the chosen term is the right one for the record.

A DSDECOD value may be codelist‑compliant and still clash with the subject’s actual AE history. AESER = "Y" may be populated without any seriousness criterion that makes the record clinically coherent. A standardized medication or medical history term can be formally allowed and still be wrong in context.

That is the difference between terminology membership and clinical correctness. P21 checks one. Reviewers judge the other.

What Reviewers Actually Do with SDTM

Reviewers do not think in terms of “P21 clean.” They open define.xml, move into AE, EX, LB, DM, DS, and trial design domains, and try to rebuild the subject story. They ask simple questions.

Can I follow exposure history from EX?
Can I line up AEs against dosing?
Can I trust the study epochs and visit structure?
Can I move from dataset to metadata and back without confusion?

If the answer is no, they raise questions even when the package is technically compliant. Review is not just a conformance exercise. It is a clinical audit.

This is why P21 clean is a gate, not a verdict.

The Severity Tier Problem

Even when P21 does find issues, teams often misread the severity hierarchy.

In many organizations, the unwritten workflow is simple: fix Errors, selectively review Warnings, ignore Notices. That sounds practical, but the severity tiers are tied to standards language, not to what a reviewer will care about most.

Some warnings that get waved through too easily:

SD0052: non‑standard variable labels that later create metadata confusion
SD0083: variables in define.xml but not in the dataset, often a real build or metadata‑sync problem
SD0256 / SD0257: date‑format inconsistencies that may be intentional, but still need to be explained clearly in metadata

Notices are often treated as background noise, even though unusual value patterns, rare coded terms, and odd visit distributions are exactly the things that can point to real mapping problems.

P21 severity reflects standards conformance logic. It does not map cleanly to reviewer concern. Treating those two hierarchies as the same is where teams get surprised.

What Robust QC Actually Looks Like

A P21‑clean package is the floor. Submission‑ready work needs another layer.

Cross‑domain temporal checks: AE against EX, DS against AE outcomes, LB against DM windows, MH against subject reference dates
EX completeness checks: separate records for interruptions, reductions, and distinct administrations where needed
LB consistency checks: unit conversion logic, standardized ranges, site‑level and vendor‑level consistency by test
DM reference‑date checks: RFSTDTC against earliest EX, death variables against DS, reference dates in plausible order
Define.xml coverage checks: actual dataset values cross‑checked against VLM entries
Trial design reconciliation: TA, TE, TV, TI, and TS reviewed against protocol and actual study conduct
RELREC validation: verify that linked identifiers actually exist and represent useful relationships

And then there is the step that catches more than teams like to admit.

Have one programmer act like a reviewer. Start from the cSDRG or define.xml. Follow variable origins. Rebuild a few subject‑level stories end to end. Any confusion there is not theoretical. It is a likely review problem waiting to happen.

The One‑Subject Test

Before submission, do this once.

Pick one subject, ideally someone with an AE, a dose change, and at least one out‑of‑range lab result.

Now try to:

reconstruct the full dosing history from EX alone,
align AEs against exposure using --DY variables,
review LB trends with the correct normalized units and reference ranges,
confirm the study epoch from trial design and findings domains,
trace disposition through DS and confirm the reference dates in DM agree.

If that exercise is slow, confusing, or full of workarounds, the issue is not validation. The issue is SDTM quality.

P21 will not run that test for you.

The Deeper Issue

The bigger problem is not the tool. The bigger problem is what teams ask the tool to stand in for.

P21 was built to check conformance. It does that well. But many submission pipelines quietly promote it into a proxy for clinical consistency, metadata adequacy, and overall package quality. That is a category mistake.

Conformance means the data fits the standard. Quality means the data represents the study faithfully, holds together across domains, remains traceable from source to dataset to define.xml, and can survive a competent reviewer trying to audit it.

Those are not the same thing.

P21 can tell you that EPOCH is spelled correctly and drawn from the right codelist. It cannot tell you that EPOCH = "TREATMENT" on a pre‑dose record is wrong. It can tell you EX is structurally valid. It cannot tell you a collapsed exposure history has erased the real dosing story. It can tell you define.xml is schema‑valid. It cannot tell you whether the reviewer can actually follow your logic.

That judgment still lives with the people building the submission.

Clean SDTM passes validation. Strong SDTM survives review. Those are not the same bar.

Character Encoding, Japanese Text, and Why Your SDTM Package Can Fail Even When the Data Logic Is Fine

2026-03-30T21:45:00.005-04:00

Your SDTM derivations are correct.
Your P21 run is clean.
Your define.xml opens and looks fine.

And yet, the package still trips up in review.

Not because of the data.
Because of encoding.

PMDA expectation

What PMDA expects, and why it trips teams

PMDA’s Technical Conformance Guide states that if languages other than English are used, including Japanese, the character set and encoding scheme must be documented in the reviewer’s guide.

Source: PMDA Technical Conformance Guide on Electronic Study Data Submissions, April 2024

This is not a footnote. It shows up in real submissions when XML, metadata, or reviewer tools fail to render text consistently.

Common misunderstanding

The key misunderstanding

Many teams assume:

“We’ll just use ASCII.”

The actual expectation is:

Use Unicode, typically UTF-8, as the working encoding

Restrict dataset content to ASCII-compatible characters where required

These are not the same thing.

Root constraint

The XPT format limitation that drives the whole problem

SDTM datasets are submitted in SAS Transport v5 (XPT) format.

That format:

was designed for US-ASCII exchange

has no encoding metadata in the file header

does not tell the receiver what encoding was used

The receiving system has to guess the encoding.

When SAS opens an XPT file created in a different encoding, it attempts transcoding.

That can result in:

WARNING: Some character data was lost during transcoding in the dataset.

That message does not identify the variable or the observation.

Your data can be corrupted silently.

Programming consequence

Byte limits, not character limits

XPT constraints are in bytes, not characters.

For UTF-8 encoded Japanese text, one character typically uses about 3 bytes.

Practical effect:

40-byte label → about 13 Japanese characters
200-byte variable → about 66 Japanese characters

This affects:

variable lengths

labels

define.xml metadata

macro logic that assumes character counts

Failure points

Where encoding actually breaks

Encoding problems do not usually show up during mapping. They surface later:

XML rendering

stylesheet loading

reviewer-side tools

XPT read and write

Common failure patterns include:

define.xml renders incorrectly

XML parsing fails

dataset comments become unreadable

XPT import causes truncation

Pipeline cracks

Where the cracks really come from

SAS session encoding

If SAS is not UTF-8:

transcoding occurs

data may be altered silently

Recommended setup: SAS session encoding = UTF-8

XML generation

XML requires strict consistency:

<?xml version="1.0" encoding="UTF-8"?>

If the declared encoding does not match the actual encoding, parsing issues follow.

External tools

Excel, Notepad, and XML editors often change encoding silently or introduce hidden characters.

Manual edits

Opening and saving XML manually can change encoding without warning.

Copy and paste risk

Copying from Word or email can introduce hidden non-ASCII characters.

The non-ASCII scan macro below will surface these characters before handoff. That is why running it late in the build cycle, not just at the start, matters.

Build-chain warning

CPORT is not XPT

Do not use PROC CPORT for submission datasets.

Even if the file extension is .xpt, CPORT does not create XPT v5 format.

PMDA gateway cannot process CPORT files.

Source: Pinnacle 21 Help Center, PMDA Engine Update 2211.0

Correct build pattern

How to generate XPT correctly

Use the LIBNAME XPORT engine:

libname out xport '/path/to/output/ae.xpt'; proc copy in=mylib out=out; select ae; run; libname out clear;

Why UTF-8 wins

Why UTF-8 is the correct approach

UTF-8 is not just about consistency. It matches the rest of the submission pipeline:

EDC systems are typically Unicode

define.xml is XML with UTF-8 declaration

Pinnacle 21 runs in a Unicode session

Using UTF-8 end to end avoids transcoding and reduces the risk of silent data loss.

Japanese text path

What PMDA expects when Japanese text is involved

PMDA allows two paths.

If translation does not lose meaning

Submit the English-translated dataset.

If translation would lose meaning

Submit both:

the Japanese dataset

the English-translated version

This is the correct alternative to simply saying “just use ASCII.”

Practical SAS check

Scan for non-ASCII characters before handoff

%macro check_nonascii(lib=, dsn=); data non_ascii_check; set &lib..&dsn; array _char _character_; do i = 1 to dim(_char); if prxmatch('/[^\x00-\x7F]/', _char{i}) then do; dataset = "&dsn"; variable = vname(_char{i}); value = _char{i}; output; end; end; keep dataset variable value; run; %mend check_nonascii;

This gives you a repeatable way to surface non-ASCII content before handoff.

Reviewer guide

What PMDA expects in practice

PMDA expects:

encoding clearly documented

character set explained

consistency across all files

The reviewer guide should include:

SAS session encoding

XML encoding, typically UTF-8

dataset character constraints

handling of non-English text

Example language

Example reviewer guide note

Character Encoding: All datasets and metadata files were generated using UTF-8 encoding. Dataset content is restricted to ASCII-compatible characters. Japanese text, where required, is handled per PMDA guidance and represented consistently across datasets and metadata. All XML files include explicit encoding declarations.

Why this matters

FDA workflows are mostly English, so encoding problems are less visible until something breaks downstream.

PMDA workflows often include Japanese and explicitly require encoding clarity, which makes encoding a submission risk, not just a technical detail.

Encoding is one of the few areas where your data can be completely correct and your submission can still fail.

Sources

References

PMDA Technical Conformance Guide (April 2024)

PMDA Electronic Study Data Review Page

Pinnacle 21 Help Center — PMDA Engine Update 2211.0

FDA Study Data Technical Conformance Guide

Passing Validation Isn’t Enough: What PMDA Actually Reviews in Your Submission Package

2026-03-30T21:24:00.004-04:00

For PMDA, a clean validation run is only part of the story. The real pressure is in the documentation layer, rule-version timing, reviewer guide detail, and how clearly the package explains itself at submission.

Most SDTM teams have a handoff checklist.

Datasets locked.
define.xml generated.
Reviewer guide drafted.
P21 run clean.

Done.

For PMDA, that checklist is not complete.

The submission package is not just what you validated.
It is:

how you validated

what you validated with

what changed between runs

how clearly you explained every finding that was not corrected

That last part is where PMDA feels fundamentally different from FDA.
Not in the data standards. In the documentation standards.

Validation philosophy

The core difference in validation philosophy

Both FDA and PMDA use Pinnacle 21 Enterprise.

But they do not treat the results the same way.

FDA has deprecated severity classification. There is no formal Reject, Error, or Warning requirement driving submission acceptance. The expectation is simple:

Every unresolved issue must be explained.

PMDA is different.

PMDA maintains a strict severity hierarchy:

Reject

Error

Warning

This is not cosmetic.

Reject-level findings stop review.

PMDA will not begin or continue review when:

Reject issues are present

validation cannot run because files are corrupt or malformed

This is enforced at the gateway level.

According to Pinnacle 21 documentation, applications have been halted due to unresolved Reject findings.

Source: Pinnacle 21 Help Center, PMDA Engine Update 2211.0

For programmers, this changes your priority:

Reject findings must be fully resolved before handoff. Everything else comes after.

Engine version control

Validation engine version is submission metadata, not background context

For FDA, the reviewer guide typically records:

tool used

issues remaining

That is usually enough.

For PMDA, the validation engine version is part of the submission record.

PMDA publishes acceptable rule versions tied to submission windows.

Practical view:

Version 3.0 → PMDA 2010.2 → Jan 2022 – Mar 2025
Version 4.0 → PMDA 2211.1 → Apr 2023 – Mar 2026
Version 5.0 → PMDA 2311.0 → Apr 2024 – Mar 2027
Version 6.0 → PMDA 2411.0 → Apr 2025 onward

Source: PMDA Electronic Data Review Page

PMDA runs your package through its own validation environment and compares results against your reviewer guide.

If your engine and PMDA’s engine differ:

findings differ

explanations don’t match

queries follow

Submission timing

The rule-version problem most teams miss

Validation is tied to submission timing.

A common scenario:

study validated months earlier

submission delayed

new engine becomes current

new findings appear

Now your reviewer guide no longer reflects what PMDA sees.

PMDA states:

submission validation uses the current acceptable rule set

follow-up data may use the rule set active at filing

Source: PMDA Electronic Review Page

If your validation engine is no longer acceptable at submission time, validation must be rerun, findings must be reassessed, and the reviewer guide must be updated.

This is not a documentation update.

It is a re-validation requirement.

Multi-study applications

One engine per application

For multi-study programs, PMDA expects one validation engine version across the entire application.

During development:

different studies may use different engines

At submission:

everything must align

If:

Study A → PMDA 2211.1

Study B → PMDA 2311.0

Someone must reconcile that before submission.

It does not fix itself.

Reviewer guide detail

What the reviewer guide must actually contain

The PMDA reviewer guide is not a summary.

It is a structured validation record.

It must include:

validation tool and version

engine version, explicitly named

rule version

for each unresolved finding:

rule ID

severity

justification

Reject findings must not remain.

If they do, you do not have a documentation issue.
You have a submission-blocking issue.

Example language

What “good” looks like

Validation Summary: Initial validation performed using Pinnacle 21 Enterprise with PMDA Engine v5.0. Final validation performed prior to submission using PMDA Engine v6.0, which is the acceptable version at the time of submission. New findings identified in final validation were reviewed and assessed. All Reject-level issues were resolved prior to submission. Remaining findings are documented with justification in Section 6.3.

Form A

Form A is not the SDSP, and timing matters

PMDA requires the Explanation of Electronic Study Data, Form A.

no longer required before initial submission for applications dated from October 2023 onward

still required for pre-NDA consultation

still required for supplemental submissions

must be updated if PMDA requests clarification

Before handoff, confirm:

Form A is current

it aligns with the reviewer guide

it reflects the validation engine used

Mismatch between Form A and the reviewer guide is a known source of PMDA queries.

Define.xml control

Define.xml is a validation object, not a publishing step

PMDA independently validates:

dataset vs dataset consistency

define.xml vs dataset consistency

XML structure

Source: PMDA Technical Conformance Guide

Define.xml 1.0 is not accepted.
2.0 and 2.1 are required.

If your pipeline still produces Define.xml 1.0, that is a Reject-level issue.

PMDA also strongly expects Analysis Results Metadata, ARM, in ADaM define.xml. This documents the link between define.xml, analysis outputs, and CDISC standards.

Define.xml must reflect:

final datasets

final derivation logic

final value-level metadata

If it lags, your package is not ready.

Pre-handoff QC

Pre-handoff checklist for PMDA

Before calling a package ready:

Define.xml is 2.0 or 2.1

Validation engine version is explicitly documented

Engine version is valid for the submission date

All Reject findings are resolved

Error findings are documented in the reviewer guide and Form A

Rule IDs, severity, and explanations are included

Validation logs are archived

Engine version is unified across studies

Submission date is recorded for revalidation risk

The FDA checklist is shorter. That is the point.

The question PMDA is actually asking

FDA asks:
Did you validate your data?

PMDA asks:
Can you prove how you validated, with what, when, and is that still valid at submission?

Your handoff package is the answer.

Sources

References

PMDA Technical Conformance Guide (April 2024)

PMDA Electronic Study Data Review Page

Pinnacle 21 Help Center — PMDA Engine Updates (2211.0 / 2211.1)

PhUSE EU Connect 2024, Paper SA06

PhUSE 2021, Paper EP-146

FDA Study Data Technical Conformance Guide (Oct 2024)

FDA vs PMDA Submissions: What Really Changes for SDTM Programmers and Define.xml Teams

2026-03-30T19:32:00.004-04:00

The real differences are usually not in domain structure. They show up in validation timing, metadata discipline, reviewer guide structure, rule-version control, encoding, and how clearly the submission package explains itself.

Most teams say they have “global submission-ready SDTM.” That usually means the datasets validate, define.xml opens, and the reviewer guides exist.

But “submission-ready” is not the same as being ready for every agency.

The FDA and PMDA overlap a lot. Both expect standardized study data. Both expect define.xml. Both run conformance checks. But the habits that work for one agency can still create extra work, or extra risk, for the other.

For senior programmers, the real difference is usually not domain structure. It shows up in how metadata is described, how validation is explained, how rule versions are tracked, how text is encoded, and how the package is documented.

What stays the same

Core SDTM discipline does not change

At the core, your SDTM package still needs to be CDISC-conformant, traceable, and reviewable.

That part does not change between FDA and PMDA. You still need to build traceable, conformant, and reviewable data.

Consistent derivations

Stable controlled terminology handling

Clean date logic

Correct SUPP usage

Complete value-level metadata

Reviewer guides that explain what a validator alone cannot explain

If your SDTM is not stable, no agency-specific packaging step is going to save you.

Where PMDA feels different

Documentation gets operational, not just technical

The biggest shift is documentation depth.

PMDA expects teams to be clear about the validation setup itself, not just the final outcome. That means the reviewer guide is no longer just background text. It becomes part of the operational record of how the package was checked.

Validation tool and version

Rule version used

Explanation of findings with rule IDs

Issue handling based on PMDA severity categories

PMDA pushes teams to demonstrate how they validated, not just that they did.

FDA also expects reviewer guides. But PMDA makes the validation process itself much more visible. If your team validates with one engine, fixes with another, and submits after a later engine becomes current, that gap needs to be visible and defendable.

Critical difference

What can actually break your submission

One important distinction is often underplayed. PMDA validation findings are not just documentation items. They can directly affect review acceptance.

PMDA uses Severity levels: Reject, Error, Warning

Reject-level findings can halt review until fixed

Reviewer guide issue summaries need to reflect that structure

This is not just a documentation problem. It is a submission acceptance risk.

FDA works differently. FDA no longer uses the same severity model in the same way, and the focus is on explanation of unresolved issues rather than a PMDA-style Reject gating model.

This changes how aggressively you fix findings before submission and how you structure the issue summary in the reviewer guide.

Validation model

FDA and PMDA do not use the same rule system

One common mistake is treating validation as a single system. It is not.

FDA uses FDA Validator Rules

PMDA uses its own published, versioned rule sets

You are not running one validation. You are running two different rule systems.

This affects which findings appear, how those findings are grouped, and what must be fixed versus explained.

Validation timing

The rule-version issue is not small

PMDA applies the latest acceptable validation rules at submission, but follow-up data may use the rule version active when the application was filed.

Validation is not a one-time milestone.

It is time-sensitive. For programmers, that changes behavior in a practical way.

You need a rerun close to submission

You need traceability of engine and rule versions

You need alignment between validator output and the reviewer guide

You need to check whether your planned submission date changes which engine is acceptable

Operational reality: PMDA publishes acceptable validation engines and rule versions. A team should check the acceptable engine on the planned submission date, not just the engine used earlier in study closeout.

Example: A team validated with Pinnacle 21 using one PMDA engine during closeout, but the final submission happened after a newer acceptable engine became current. New rules flagged findings that did not exist earlier. The reviewer now sees issues the team never documented. That is not really a data problem. It is a submission timing problem.

If the engine used during validation is no longer acceptable at submission time, validation may need to be rerun and the reviewer guide updated.

Document rule versions directly in your cSDRG, and archive validation logs at each rerun point. That one step prevents a lot of avoidable review confusion.

Encoding risk

Character encoding can become a late-stage submission issue

FDA-centered workflows often run with English-only assumptions. PMDA submissions can make character encoding much more visible, especially when Japanese text appears in supporting material, annotations, comments, or linked documentation.

This affects more than just the dataset itself.

SAS session encoding

XML generation

External file exports

Stylesheet rendering

Round-trip handling between tools

Unicode, typically UTF-8 session encoding, is the safer working setup. Dataset content still needs to stay ASCII-compatible where required.

Character encoding issues often surface late, during stylesheet rendering or define.xml validation, when they are hardest to fix.

Define.xml matters

Define.xml needs to do more than exist

Many teams still treat define.xml as a final publishing step. That is where trouble starts, especially for dual-agency submissions.

Define.xml is not just a technical artifact. It is part of what reviewers actually read. If the datasets have moved but your metadata still reflects an earlier state, you are going to create confusion even when the package technically opens.

Define.xml isn’t output decoration. It’s part of the submission.

For PMDA work, another practical point often missed is ARM. PMDA teams often need to think more carefully about Analysis Results Metadata placement and whether the ADaM definition package is telling the reviewer enough without forcing them into extra cross-referencing.

If your metadata lags behind your datasets, you are not submission-ready.

SDTM impact

Small differences that show up in datasets

Not every FDA versus PMDA difference is structural. Some show up in day-to-day programming details.

Units: teams often need to think about conventional units for FDA-facing expectations versus SI-unit expectations in PMDA-facing work

Reviewer guide issue layout: PMDA severity categories change how issue summaries are written

ADaM metadata packaging: ARM handling can be more visible in PMDA-oriented builds

These are not always massive coding changes. But they affect how datasets and metadata are interpreted during review.

Side-by-side view

FDA vs PMDA, what actually differs

Area FDA PMDA

Validation Rules FDA validator rule set PMDA-specific published rule set

Define.xml Expected as part of the submission metadata package Expected with style sheet and checked closely against datasets

Reviewer Guide Expected and important for review context More operational, should document validation setup and findings clearly

Issue Classification Focus on explanation of unresolved issues Severity model matters, Reject/Error/Warning

Submission Risk Findings generally drive questions and clarification Reject findings can block or suspend review until fixed

Rule Version Handling Usually less visible in submission narrative Timing matters because the acceptable engine and rule context can change

Encoding Often English-only in practice Needs more care when non-English content is present

Units Conventional unit expectations more common SI-unit expectations more visible

Validation Scope Datasets and metadata must be reviewable Cross-checks across datasets, metadata, and XML structure matter more visibly

Build model

Recommended workflow

For efficiency, separate at the packaging and documentation layer, not the derivation layer.

One SDTM derivation pipeline

One controlled metadata source

One conformance issue log

Agency-specific reviewer guide wording

PMDA-specific engine and rule-version tracking

Explicit encoding checks

Final validation rerun close to submission

This keeps programming unified while allowing submission differences where they actually matter.

Sample cSDRG language

Example cSDRG excerpt for rule-version documentation

Validation Summary: All SDTM datasets were validated using Pinnacle 21 Enterprise with the PMDA engine and rule version acceptable at the time of final submission. Initial validation was performed earlier in study closeout using a prior acceptable engine. A final rerun was conducted prior to submission to align with the current acceptable engine and rule set. Any new findings introduced in the final rerun were reviewed and assessed before submission. Issue details, rationale, and resolution status are documented in Section 6.3. For FDA-facing review, unresolved issues are explained in the Issue Summary. For PMDA-facing review, issues are grouped and described using the applicable severity structure.

The point is not just to say what was used. The point is to show that the final submission package was checked against the acceptable rule context at the time of submission.

Sample define.xml metadata

Example define.xml snippet showing value-level clarity

<ValueListDef OID="VL.AE.AESTDTC"> <ItemRef ItemOID="IT.AE.AESTDTC" Mandatory="Yes"/> </ValueListDef> <WhereClauseDef OID="WC.AE.PARTIAL"> <RangeCheck Comparator="EQ"> <CheckValue>PARTIAL</CheckValue> </RangeCheck> </WhereClauseDef> <ItemDef OID="IT.AE.AESTDTC" Name="AESTDTC" DataType="text"> <Description> <TranslatedText> Start date of adverse event. Partial dates are imputed to the first day of the month when day is missing. </TranslatedText> </Description> </ItemDef>

This kind of wording reduces reviewer confusion when partial date handling differs across domains or when imputation rules need to be stated plainly.

Final point

The difference is not really about standards versions alone.

It is about submission narration, what you validated, with what, when, under which rule set, and how clearly your metadata explains the data.

PMDA makes these expectations explicit and enforces them through validation outcomes. FDA expects the same clarity, but relies more on explanation and reviewer interpretation.

A Define.xml Review Checklist I Actually Use Before Submission

2026-03-30T15:44:00.004-04:00
A Define.xml Review Checklist I Actually Use Before Submission

An SDTM-focused practical checklist for reviewing define.xml before submission, with emphasis on reproducibility, traceability, consistency, and the reviewer-facing problems that weak metadata creates.

If you work on SDTM submissions long enough, you learn that define.xml is never just a metadata file.

It is the reviewer’s map to the datasets, the controlled terminology, the derivations, the value-level rules, and the awkward corners of the study that never fully fit the standard.

Over time, I stopped treating validation as the only sign-off gate. I started using a review checklist that asks a harder question:

If I were a reviewer opening this package for the first time, would I understand the SDTM data without asking the sponsor what they meant?

A strong define.xml does two jobs at once. It tells the reviewer what is in the submission, and it tells them how to think about it. That is why I review it at three levels: package consistency, metadata accuracy, and reviewer usability. A file can be technically valid and still be weak in one of the other two.

What follows is the checklist I actually use before an SDTM submission goes out the door.

Figure 1. Validation clean is not the same as review-ready

A practical difference between technical conformance and reviewer understanding.

Start with the submission package

Before I even open the XML structure itself, I verify that the dataset package and metadata package agree on the basics.

Checklist

Confirm every submitted SDTM domain appears in define.xml.

Confirm define.xml does not list any domain that is not actually in the submission package.

Verify domain names, labels, classes, and structures match the submitted datasets.

Check file names, folder placement, and package conventions are consistent.

Check the SDRG, aCRF, datasets, and define.xml all point to the same final delivery.

Confirm standards and versions are explicitly identified

This sounds basic, but it is one of the easiest things to leave half-finished when metadata is updated late.

Checklist

Confirm the SDTMIG version is correctly identified.

Confirm the Define-XML version is the one intended for the submission.

Confirm controlled terminology versions are named consistently.

Confirm external dictionaries such as MedDRA, WHODrug, LOINC, or other study-level standards are versioned consistently across define.xml, SDRG, and study documentation.

Confirm no old standard version labels remain from template reuse.

I always treat version signaling as a reviewer orientation issue, not just a metadata housekeeping issue.

Check dataset metadata first

The fastest way to spot a weak define.xml is to compare dataset-level metadata against the actual XPT files.

Checklist

Dataset label matches the dataset’s real purpose and SDTM domain.

Class is correct and consistent with SDTM usage.

Structure is correct, including whether the domain is one record per subject, one record per event, or another expected pattern.

Keys and identifier variables are consistent with the domain content.

Dataset-level comments explain anything unusual the reviewer needs to know.

I never trust the metadata spec alone here. I compare the XPT header, define.xml dataset metadata, and the mapping spec line by line for high-risk domains like DM, EX, AE, LB, VS, DS, and SUPP--.

Then review variables one by one

This is where most quiet problems live.

Checklist

Variable name, label, type, length, and format match the actual dataset.

Variable order is sensible and consistent with the implementation.

Core SDTM variables are present where expected.

Required, expected, and permissible usage is justified by the domain.

Controlled terminology fields actually point to the right codelist, and the codelist reflects what is used in the data.

1. Can the variable be reproduced from define.xml alone?

This is the first true reviewer test I use.

If a reviewer or another programmer had only the SDTM dataset and define.xml, could they recreate the variable safely?

If the answer is no, the metadata is not complete enough.

What I check

Formula is explicitly stated.

Anchor variables are named.

Selection logic is written, not implied.

Units and conversions are visible.

The description is specific enough that another programmer could reproduce the result without opening an internal spec.

Weak

Derived from reference start date.

Better

Study day is calculated as Event Date minus DM.RFSTDTC plus 1 when Event Date is on or after DM.RFSTDTC; otherwise Event Date minus DM.RFSTDTC. Records with partial dates are not assigned study day.

2. Are boundary conditions clearly defined?

Most ambiguity comes from the edges, not the main rule.

What I check

Same-day records

Missing time

Partial dates

Multiple qualifying records

Pre-treatment versus post-treatment boundary

For SDTM Findings flags such as LBLOBXFL, reviewers usually ask the same things.

Is “prior” based on date or datetime?

Are same-day records eligible?

What if time is missing?

How is “last” selected?

SDTM XML example for LBLOBXFL

Listing 1. Reviewer-friendly method description

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1"> <Description> <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText> </Description> <Origin Type="Derived"/> <MethodRef MethodOID="MT.LB.LBLOBXFL"/> </ItemDef> <MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag Derivation" Type="Computation"> <Description> <TranslatedText xml:lang="en"> LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing result collected before first exposure. If only dates are available, collection date must be strictly earlier than DM.RFSTDTC. Records on the first-dose date are eligible only when both collection time and dosing time are available and the collection occurs before dosing. Records with missing time on the first-dose date are not eligible. If multiple qualifying records exist, the latest chronological record is selected. </TranslatedText> </Description> </MethodDef>

Figure 2. Boundary cases that should be visible in define.xml

These are the places where reviewer interpretation usually starts to diverge from team intent.

3. Is partial date handling explicitly documented?

Partial date handling is one of the biggest sources of inconsistency across SDTM.

Many define.xml files simply say:

Partial dates were imputed.

That does not tell the reviewer enough.

What I check

Which patterns are imputed

What values are assigned

Where the imputation is used

Whether imputed values are stored in SDTM

Whether the logic is consistent across domains

Better

AE start dates in YYYY-MM format are imputed to the first day of the month for treatment-emergent classification only. Imputed values are not stored in SDTM and are not used for time-to-event analyses.

4. Is unit standardization clearly described?

For domains such as LB, VS, and EG, this matters more than many teams expect.

What I check

Whether results are converted

What source drives the conversion

Whether standardization happens before flag derivation

How character results are handled

Weak

Standard unit.

Better

Results for LBTESTCD = ALT are standardized to U/L using approved central lab conversion factors before derivation of LBNRIND. Character results reported as below quantification limit remain in LBSTRESC and do not populate LBSTRESN.

Value-level metadata example for lab standardization

Listing 2. Example VLM description pattern

<WhereClauseDef OID="WC.LB.ALT"> <RangeCheck Comparator="EQ" SoftHard="Soft"> <CheckValue>ALT</CheckValue> <ItemOID>IT.LB.LBTESTCD</ItemOID> </RangeCheck> </WhereClauseDef> <ItemDef OID="IT.LB.LBSTRESN" Name="LBSTRESN" DataType="float"> <Description> <TranslatedText xml:lang="en"> For LBTESTCD = ALT, LBSTRESN is standardized to U/L using approved central lab conversion factors before derivation of LBNRIND. </TranslatedText> </Description> </ItemDef>

Controlled terminology needs reviewer logic

Controlled terminology problems are rarely dramatic, but they are exactly the kind of thing reviewers notice.

I do not stop at checking whether a variable points to a codelist. I also check whether the codelist actually explains the values used in the dataset.

Checklist

Every coded variable points to the correct codelist.

Every coded value in the dataset is represented in the linked codelist.

Extensible versus non-extensible behavior is handled correctly.

“Other” values are used appropriately and not as a catch-all for unresolved mapping.

Custom terms are clearly identified, justified, and used only when needed.

External terminology references are consistent with the study implementation.

Traceability must make sense

Define.xml is not only about naming things correctly. It is about helping a reviewer understand where data came from and how it was derived.

Checklist

Origin is correct for each variable, especially collected, derived, and assigned variables.

Derivation descriptions are clear, concise, and reproducible.

External references, comments, and derivation logic are understandable without reading an internal spec.

If something is nonstandard, define.xml and SDRG tell the same story.

This matters most when a variable is derived from multiple sources, when date imputation is involved, or when the domain includes sponsor-specific nuances.

Review computational methods as reusable objects

In define.xml, a derivation is not just a sentence. It is a metadata object. If the same logic appears in multiple places, the method references should make that obvious.

Checklist

Each MethodDef is actually referenced where intended.

Duplicated logic is reused rather than described differently in multiple places.

Method text is specific enough to reproduce the derivation.

Sponsor-defined methods are not described so broadly that they hide record-level conditions.

Method naming is understandable to a reviewer and not only to the study team.

5. Is origin and traceability unambiguous?

This is one of the biggest reviewer confidence checks.

What I check

Is the value CRF-collected, assigned, or derived?

Is sponsor mapping logic visible?

Does define.xml align with SDRG or cSDRG wording?

Weak

Relationship to study drug.

Better

Collected on AE CRF as investigator assessment of relationship to study treatment. In studies with multiple investigational products, SDTM value represents relationship to primary study treatment as defined in protocol. Sponsor mapping rules are applied when more than one relationship is recorded.

Figure 3. Traceability path I expect define.xml to support

This is the path a reviewer should be able to follow without guessing.

Value-level metadata deserves extra attention

Value-level metadata is often where strong define.xml packages become weak. It is especially important when a variable behaves differently by record type, when metadata changes by subset, or when special derivations need precise explanation.

Checklist

Value-level metadata is used only when needed and not as a workaround for poor dataset design.

The conditions for the value-level metadata are correctly specified.

The metadata actually covers all relevant records in the dataset.

The resulting description is understandable to a reviewer who is not part of the study team.

Each VLM entry adds something useful beyond the parent variable description.

Review SUPP-- carefully

SUPP-- is often technically valid and still a sign that something needs a second look. I always check whether supplemental qualifiers are truly the right implementation, or whether the metadata is compensating for a design decision that deserves more scrutiny.

Checklist

Each supplemental qualifier is appropriate for SUPP-- use.

QNAM, QLABEL, QVAL, IDVAR, and IDVARVAL align with the parent record.

Supplemental qualifiers are traceable back to the source collection.

Reviewer-facing comments explain any heavy reliance on SUPP--.

The same concept is not represented both in a parent domain and in SUPP-- without explanation.

6. Are value-level metadata entries actually useful?

I do not look at VLM just to see whether it exists. I look at whether it adds anything useful.

What I check

Does each VLM entry add context that the parent variable does not?

Are conditions clearly defined?

Are methods aligned across subsets?

Are units, flags, and derivations consistent with the condition?

If VLM is only repeating variable-level text, it is not doing enough.

7. Is the logic consistent across domains?

This is where quiet inconsistency shows up.

Typical pattern:

AE uses imputed dates

LB excludes partial dates

VS uses visit date

EG uses datetime boundary

Each rule may be valid. But together they may look inconsistent unless the metadata explains where the differences are intentional.

What I check

Same concept, same logic where possible

If not, differences are clearly documented

Hyperlinks and references must work

A broken link in define.xml feels small until it lands in a reviewer’s lap.

Checklist

All internal references resolve correctly.

All external links point to the intended file or metadata object.

Links to codelists, origin documents, and external references render correctly in the stylesheet output.

The stylesheet displays the metadata in a readable way for human review.

Review define.xml as a reviewer would actually read it

I always open the rendered define.xml in a browser and navigate it as if I were seeing the package for the first time. This catches problems that schema validation does not.

Checklist

Dataset pages load cleanly.

Variable pages are readable and not cluttered with broken references.

Value-level metadata is easy to follow in the rendered view.

Long method text wraps correctly and is still readable.

Codelists, comments, and document links open in a way that helps rather than slows review.

Confirm consistency with SDRG and aCRF

In practice, define.xml, SDRG, aCRF, and datasets should all tell the same story about the SDTM implementation.

Checklist

Dataset descriptions in define.xml match the SDRG narrative.

Deviations from SDTM IG or controlled terminology are explained the same way across documents.

aCRF annotations support the variables and origins described in define.xml.

Custom domains or special handling are described consistently across the package.

Pay extra attention to custom domains and sponsor-defined variables

Reviewers are usually more tolerant of nonstandard implementation than teams expect, as long as it is explained clearly and consistently. What creates friction is not the existence of a custom rule. It is weak explanation.

Checklist

Custom domains are clearly identified and justified.

Sponsor-defined variables do not look like standard variables by accident.

Naming, labels, origins, and methods are aligned across define.xml and SDRG.

Reviewer-facing explanations describe why the implementation was needed, not only what was done.

8. Does the metadata match the actual SDTM data?

This sounds obvious, but it fails more often than it should.

What I check

Derivation wording matches observed values.

Units match actual standardized data.

Flags behave the way the method says they do.

No leftover template language remains.

A common example is when metadata says “latest value prior to treatment,” but same-day post-dose records are still flagged. That is not a programming issue anymore. That is a metadata credibility issue.

Run validation, but do not stop there

Validation catches structural problems. It does not catch every reviewer-facing problem.

Checklist

XML validates against the intended Define-XML schema.

No broken references or unresolved metadata objects remain.

No obvious conformance errors remain after tool-based validation.

Manual review confirms the metadata still makes reviewer sense after the final dataset freeze.

The stylesheet output is readable and matches the intended submission package.

9. Are ambiguous phrases eliminated?

These phrases are common, but they usually create more uncertainty than they remove:

Derived from reference start date

Last non-missing value

Standard unit

Partial dates were imputed

Relationship to study drug

Each one hides decisions. The fix is not more words for the sake of more words. The fix is writing the actual rule.

Recheck after the final metadata refresh

One of the most common late-stage problems is not a wrong derivation. It is a right derivation described by the wrong final metadata because the datasets, SDRG, aCRF, and define.xml did not freeze in the same rhythm.

Checklist

Final XPT files match the last define.xml build.

No late updates were made to one document but not the others.

Reviewer comments, methods, and links still point to the final objects.

The rendered define.xml reflects the actual submission package, not the pre-freeze draft.

Figure 4. The checklist I use before sign-off

A simple pre-submission pass that catches most weak-metadata problems.

10. Final test: will this create a reviewer question?

This is the last question I ask.

Can a reviewer understand this rule without asking what we meant?

If the answer is no, the metadata is still too thin.

Final thought

The define.xml packages that cause the fewest review problems are usually not the ones with the fanciest tooling.

They are the ones where the metadata, datasets, SDRG, and annotated CRF tell the same story without forcing the reviewer to fill in the gaps.

That is the standard I use before submission.

Suggested closing question for comments

Which define.xml checkpoint catches the most problems in your SDTM submissions, reproducibility, traceability, standards/version control, or boundary handling?

Five Define.xml Phrases That Sound Fine, But Trigger Review Questions

2026-03-30T13:10:00.001-04:00
Five Define.xml Phrases That Sound Fine, But Trigger Review Questions

StudySAS Blog

Five Define.xml Phrases That Sound Fine, But Trigger Review Questions

A practical look at the wording patterns that pass internal review, validate cleanly, and still create trouble when a reviewer tries to understand your SDTM logic from metadata alone.

Some define.xml wording looks perfectly acceptable during internal review.

Then the same wording creates questions during submission review.

Not because the data is wrong. Not because the programming is broken. But because the description leaves too much room for interpretation.

That gap matters more than many teams realize. Define.xml is the reviewer’s first structured view of your SDTM package. If the metadata is thin, the reviewer starts guessing. And once guessing starts, questions follow.

One useful standard

A good define.xml description should let another programmer, or a reviewer, understand the rule without relying on team memory.

Figure 1. Why these phrases fail in review

The problem is rarely the variable itself. The problem is the space between what the team knows and what the metadata actually says.

1. “Derived from reference start date”

This is common for --DY variables.

It sounds reasonable. But it does not tell the reviewer enough to recreate the rule safely.

What is missing:

The actual formula

The day 1 boundary rule

How pre-treatment values are handled

What happens with partial dates

Whether the logic is date-based or datetime-based

Weak

Derived from reference start date.

Better

Study day is calculated as Event Date minus DM.RFSTDTC plus 1 when Event Date is on or after DM.RFSTDTC; otherwise Event Date minus DM.RFSTDTC. No date imputation is applied for this derivation. Records with partial event dates are not assigned study day.

The second version does real work. It tells the reviewer what the formula is, where the boundary sits, and what is excluded.

2. “Last non-missing value prior to treatment”

This is one of the most common metadata phrases in findings logic, especially when teams derive an SDTM flag such as LBLOBXFL.

The wording sounds precise. It is not.

It leaves open several questions:

What defines “prior”, reference start date or actual exposure datetime?

What happens for records on the first-dose date?

What if collection time is missing?

Are unscheduled visits included?

If more than one record qualifies, how is “last” decided?

SDTM XML example, weak vs better

Listing 1. Minimal method description for LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1"> <Description> <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText> </Description> <Origin Type="Derived"/> <MethodRef MethodOID="MT.LB.LBLOBXFL"/> </ItemDef> <MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag" Type="Computation"> <Description> <TranslatedText xml:lang="en"> Last non-missing result prior to treatment. </TranslatedText> </Description> </MethodDef>

Listing 2. Reviewer-friendly method description for LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1"> <Description> <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText> </Description> <Origin Type="Derived"/> <MethodRef MethodOID="MT.LB.LBLOBXFL"/> </ItemDef> <MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag Derivation" Type="Computation"> <Description> <TranslatedText xml:lang="en"> LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing result collected before first exposure. If only dates are available, collection date must be strictly earlier than DM.RFSTDTC. Records on the first-dose date are eligible only when both collection time and dosing time are available and the collection occurs before dosing. Records with missing time on the first-dose date are not eligible. If multiple qualifying records exist, the latest chronological record is selected. </TranslatedText> </Description> </MethodDef>

The second version gives the reviewer something operational. It defines the anchor, the same-day rule, the missing-time rule, and the tie-break rule.

3. “Partial dates were imputed”

This is a classic phrase that carries almost no review value by itself.

It does not answer:

Which date patterns were imputed

What values were assigned

Where the imputation was used

Whether SDTM values were changed or left as collected

Whether the rule applies across domains or only in one place

Weak

Partial dates were imputed.

Better

Partial AE start dates are imputed for treatment-emergent classification only. Dates in YYYY-MM format are imputed to the first day of the month. Dates in YYYY format are imputed to January 1. Original collected values remain in AESTDTC. Imputed dates are not stored in SDTM and are not used for survival or time-to-event analyses.

The key point is not just the method. It is the scope.

4. “Standard unit”

This shows up in lab metadata all the time, especially when teams rely on short value-level notes.

The phrase does not tell the reviewer:

Whether results were converted

How vendor-specific factors were handled

Whether standardization happened before or after flag derivation

What happened to character results such as below quantification limit

Weak

Standard unit.

Better

Results for LBTESTCD = ALT are standardized to U/L in LBSTRESU. When LBORRESU differs from U/L, conversion uses approved central lab conversion factors before derivation of LBNRIND. Character results reported as below quantification limit remain in LBSTRESC and do not populate LBSTRESN.

This kind of wording is much more useful to experienced programmers because it states order of operations and data-type behavior.

5. “Relationship to study drug”

This looks harmless, but it often hides one of the hardest traceability questions in SDTM: was the value collected, assigned, or sponsor-derived?

That question gets sharper in studies with more than one treatment or more than one possible relationship target.

Weak

Relationship to study drug.

Better

Collected on the AE CRF as investigator assessment of relationship to study treatment. In studies with multiple investigational products, the SDTM value represents relationship to the primary investigational product defined in the protocol. When multiple products are recorded, sponsor mapping follows the hierarchy specified in the study data handling conventions.

The stronger version makes the origin and sponsor logic visible. That is what makes the metadata useful.

Figure 2. The hidden pattern behind all five phrases

The wording changes from short labels to actual rules.

What these phrases have in common

None of these phrases are always wrong.

The problem is that they are too short for the job they are trying to do.

They work as internal reminders for a team that already knows the logic. They do not work well as reviewer-facing metadata.

That is the shift worth making in define.xml work. Stop writing labels that point to logic. Start writing metadata that explains the logic.

A practical test before sign-off

Before a define.xml package goes out, ask this:

Can another programmer, or a reviewer, understand this rule without asking what we meant?

If the answer is no, the description is probably too thin.

Final thought

Most review friction in define.xml does not come from dramatic mistakes.

It comes from ordinary wording that feels good enough until someone outside the team tries to rely on it.

That is why these five phrases matter. They look small. But they are often the exact place where trust in the metadata starts to weaken.

Suggested closing question for comments

Which define.xml phrase do you see most often in submissions that sounds acceptable internally, but creates questions in review?

Your SDTM Passed Validation. That Doesn’t Mean You’re Safe

2026-03-30T11:54:00.003-04:00
Your SDTM Passed Validation. That Doesn’t Mean You’re Safe.

StudySAS Blog

Your SDTM Passed Validation. That Doesn’t Mean You’re Safe.

Why clean Pinnacle 21 results do not always mean your SDTM package is ready for review, and why define.xml still decides how quickly a reviewer can understand and trust your data.

Most teams celebrate when Pinnacle 21 is clean.

That makes sense. It feels like the hard part is over.

But regulators do not review submissions that way.

They start with define.xml.

Across repeated submission work, one pattern becomes obvious.

Clean datasets get you submitted.
Clear metadata gets you through review.

Figure 1. What teams think vs what reviewers actually do

A simple process view of the gap between validation completion and actual reviewer workflow.

What reviewers actually do first

Before they ever look at code, reviewers usually follow a simple path:

Open define.xml

Search for a variable or derivation rule

Read origin, comments, method, and value-level metadata

Decide whether the logic is clear enough to trust

Go to SDRG, ADRG, or programs only if something is still unclear

If define.xml is vague, questions start early. Not because the programming is wrong, but because the reviewer cannot safely infer what you meant.

Practical point

A clean validation report tells you the package is technically acceptable. It does not tell you the metadata is reviewer-friendly.

A real example from SDTM LB

Here is the kind of define.xml statement many teams use for an SDTM Findings flag:

Last observation before exposure flag is assigned to the last non-missing result prior to treatment.

On paper, that looks fine.

In review, it often is not enough.

A reviewer can reasonably ask:

What defines “prior to treatment”, RFSTDTC or exposure datetime?

What happens for records collected on the same day as first dose?

What if collection time is missing?

Are unscheduled visits included?

If multiple qualifying values exist, how is “last” decided?

Is the same rule used across LB, VS, EG, and QS?

The data may be perfectly correct. The issue is that the metadata leaves room for more than one interpretation.

Figure 2. Weak metadata vs strong metadata

The difference is not style. It is whether the reviewer has to guess.

What strong metadata looks like in SDTM

A better define.xml statement does not just sound more formal. It removes doubt.

Example of stronger wording

LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing result collected before first exposure. If only dates are available, collection date must be strictly earlier than DM.RFSTDTC. Records on the first-dose date are eligible only when both collection time and dosing time are available and the collection occurs before dosing. Records with missing time on the first-dose date are not eligible. If more than one qualifying records exist, the latest chronological record is selected.

Now the reviewer knows the anchor, the same-day rule, the missing-time rule, and the tie-break rule.

CDISC-style metadata flow

At a practical level, define.xml sits in the middle of a traceability chain. The reviewer should be able to move through that chain without guessing.

Figure 3. Traceability path from collection to reviewer interpretation

A CDISC-style flow showing how collected data becomes reviewer-facing metadata.

XML snippet, weak vs stronger version

One of the best ways to see the problem is in the XML itself. Here is the same SDTM concept shown two different ways.

Weak XML example

Listing 1. Minimal method description for SDTM LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1"> <Description> <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText> </Description> <Origin Type="Derived"/> <MethodRef MethodOID="MT.LB.LBLOBXFL"/> </ItemDef> <MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag" Type="Computation"> <Description> <TranslatedText xml:lang="en"> Last non-missing result prior to treatment. </TranslatedText> </Description> </MethodDef>

Stronger XML example

Listing 2. Reviewer-friendly method description for SDTM LBLOBXFL

<ItemDef OID="IT.LB.LBLOBXFL" Name="LBLOBXFL" DataType="text" Length="1"> <Description> <TranslatedText xml:lang="en">Last Observation Before Exposure Flag</TranslatedText> </Description> <Origin Type="Derived"/> <MethodRef MethodOID="MT.LB.LBLOBXFL"/> </ItemDef> <MethodDef OID="MT.LB.LBLOBXFL" Name="Last Observation Before Exposure Flag Derivation" Type="Computation"> <Description> <TranslatedText xml:lang="en"> LBLOBXFL is assigned as 'Y' to the chronologically latest non-missing result collected before first exposure. If only dates are available, collection date must be strictly earlier than DM.RFSTDTC. Records on the first-dose date are eligible only when both collection time and dosing time are available and the collection occurs before dosing. Records with missing time on the first-dose date are not eligible. If multiple qualifying records exist, the latest chronological record is selected. </TranslatedText> </Description> </MethodDef>

Where this usually breaks

From experience, these are the places where weak metadata triggers the most review friction:

Area Common weak wording What is missing

Study Day (--DY) Derived from reference start date Formula, sign convention, partial date handling

Partial dates Partial dates were imputed Method, scope, and where the imputed value is used

Lab standardization Standard unit Conversion rule, order of operations, flag impact

Cross-domain rules Separate domain notes only Whether the same concept behaves consistently across domains

Traceability Relationship to study drug Collected vs assigned vs sponsor-derived logic

Figure 4. Define.xml review checklist

A simple internal test before final package release.

A simple review checklist before submission

Reproducibility, can an experienced programmer recreate the variable using only define.xml?

Ambiguity, does the description allow more than one reasonable interpretation?

Boundary handling, are same-day, missing-time, partial-date, repeated-record, and tie cases clearly defined?

Consistency, is the same concept handled the same way across domains unless an exception is explicitly stated?

Traceability, can a reviewer move from CRF to SDTM to derived variable without guessing?

If any answer is no, the package may still validate cleanly, but it is not fully review-ready.

Final thought

Passing technical validation is necessary.

It is not sufficient.

Define.xml is not just a supporting file. For many reviewers, it is the first real interface to your SDTM data.

If they had only this file, would they understand your submission, or question it?

Suggested closing question for comments

Have you seen define.xml wording that looked fine internally, but triggered avoidable review questions later?

From Protocol to cSDRG: how to write cSDRG study design section

2025-01-23T08:30:00.003-05:00

As clinical research professionals, we often grapple with a unique challenge: transforming our forward-looking protocol documents into retrospective study documentation. One particular area where this becomes crucial is in the Clinical Study Data Reviewer's Guide (cSDRG), especially when documenting the Study Design section. Today, we'll explore why simply copying and pasting from your protocol isn't the best approach, and how to effectively translate your study design documentation into its proper historical context.

Why Time Matters in Clinical Documentation

The protocol and cSDRG serve fundamentally different purposes in the clinical research narrative. Your protocol is your roadmap - it outlines what you plan to do. The cSDRG, on the other hand, tells the story of what actually happened. This distinction is crucial for regulatory reviewers who need to understand how your study unfolded in reality.

Pro Tip: Think of your protocol as a travel itinerary and your cSDRG as a travel journal. While your itinerary shows where you planned to go, your journal documents where you actually went and what really happened along the way.

The Art of Translation: From Future to Past

Converting your study design documentation isn't just about changing verb tenses - it's about capturing the reality of your study execution. Here's how to approach this transformation effectively:

Example Transformations

Protocol (Future Tense) cSDRG (Past Tense)

"This will be a randomized, double-blind study that will enroll 100 subjects..." "This was a randomized, double-blind study that enrolled 98 subjects..."

"Subjects will receive treatment for 12 weeks..." "Subjects received treatment for 12 weeks, with a mean treatment duration of 11.2 weeks..."

"Blood samples will be collected at baseline and weeks 4, 8, and 12." "Blood samples were collected at baseline and weeks 4, 8, and 12, with additional unscheduled sampling as needed."

Best Practices for Study Design Documentation

When preparing your cSDRG Study Design section, consider these key principles:

Use Actual Numbers: Replace planned enrollment figures with actual participant counts.

Document Deviations: Include any significant departures from the planned design that occurred during study execution.

Maintain Consistency: Ensure all descriptions align with the past-tense reporting style used throughout the cSDRG.

Add Context: Include relevant details about how procedures were actually implemented.

The Impact on SDTM Integration

This careful translation of study design documentation becomes particularly important when working with Study Data Tabulation Model (SDTM) datasets. Your cSDRG serves as a bridge between your protocol's intentions and your SDTM data's reality, helping reviewers understand any discrepancies or special circumstances that emerged during the study.

Remember: Your cSDRG should tell the story of your study as it actually happened, providing context for your SDTM datasets and helping reviewers understand your data in its proper context.

Conclusion

Creating an effective cSDRG Study Design section requires more than simple copy-and-paste operations from your protocol. By thoughtfully translating your study design documentation into past tense and incorporating actual study outcomes, you create a more valuable resource for regulatory reviewers and maintain the integrity of your clinical study documentation.

The Critical Importance of Dataset Structure Documentation in Define.xml: A Senior SDTM Programmer's Perspective

2025-01-15T15:27:00.001-05:00
SDTM Dataset Structure Documentation: A Senior Programmer's Perspective

Introduction: Why I'm Writing This

After spending over 15 years mapping clinical data to SDTM, I've seen firsthand how proper dataset structure documentation can make or break a submission. Recently, I encountered a situation where incomplete structure descriptions in Define.xml led to significant rework in a late-phase study. This experience prompted me to share my insights on why meticulous documentation of dataset structures is crucial.

The Real-World Impact of Structure Documentation

Let me share a recent example from my work. We inherited a study where the LB domain structure was documented simply as:

"One record per analyte per planned time point per visit per subject"

However, the key variables included:

STUDYID, USUBJID, LBREFID, LBCAT, LBSCAT, LBTESTCD, LBMETHOD, VISITNUM, LBSTAT, LBORRES, LBDTC

This mismatch led to several issues:

Data mapping programs didn't account for method variations (LBMETHOD)

Validation checks missed status-dependent conditions (LBSTAT)

Analysis datasets required rework due to unexpected categorical groupings (LBCAT, LBSCAT)

Programming Implications

Pro Tip: Always write your SDTM specification review findings in a way that allows for quick implementation of corrections.

From a programming perspective, comprehensive structure descriptions help us:

Write more efficient data mapping code by understanding all required keys

Implement proper sort orders based on the full record uniqueness

Create more robust validation checks

Design better performance optimization strategies

Common Structural Documentation Issues I've Encountered

1. The FA (Findings About) Domain Challenge

A classic example is the FA domain, where I often see this structure:

Original: "One record per finding per object per time point per visit per subject"

What it should be:

Improved: "One record per finding per object per grouped observation (FAGRPID), including categorization (FACAT) and method (FAMETHOD), per time point per visit per subject"

Practical Solutions I've Implemented

Over the years, I've developed these practices for better structure documentation:

Automated Comparison Tool: I've created a SAS macro that compares Define.xml structure descriptions against actual key variables used in the datasets.

Structure Template Library: Maintaining a repository of comprehensive structure descriptions for common scenarios.

Review Checklist: A systematic approach to verify structure completeness.

Impact on Study Timeline and Resources

In my experience managing SDTM conversions, proper structure documentation can:

Reduce mapping programming time by ~25%

Cut validation issues by up to 40%

Minimize rework during QC and analysis dataset creation

Recommendations for Fellow SDTM Programmers

Key Practice: Always validate your structure descriptions against both the SDTM Implementation Guide and your actual data.

Based on my experience, here are crucial steps:

Review structure descriptions during specification development

Cross-reference with SDTM IG examples

Validate against actual data patterns

Document any special cases or exceptions

Conclusion: A Call to Action

As senior SDTM programmers, it's our responsibility to ensure that our Define.xml documentation serves its purpose effectively. Proper structure documentation isn't just about compliance – it's about creating efficient, maintainable, and high-quality clinical data submissions.

Remember: The time invested in proper documentation pays dividends throughout the study lifecycle and across future studies.

Share your experiences or reach out for additional insights on SDTM implementation best practices.

Mastering the Art of Comments in Define.xml: Your Ultimate Guide to Clinical Data Documentation

2025-01-15T14:31:00.002-05:00
Mastering the Art of Comments in Define.xml: Your Ultimate Guide to Clinical Data Documentation

Posted by Sarath

In the world of clinical data management, the define.xml file serves as the cornerstone of dataset documentation. While most professionals focus on the basic structural elements, the Comments tab often remains an underutilized goldmine of information. Today, we'll dive deep into how to leverage this powerful feature to enhance your clinical data documentation.

Quick Takeaway: Well-crafted comments in your define.xml can significantly reduce queries during regulatory submissions and streamline the review process.

Why Comments Matter in Define.xml

The Comments tab isn't just an afterthought - it's your opportunity to provide crucial context that doesn't fit neatly into other standardized fields. Think of it as your chance to tell the complete story behind your data.

8 Essential Comment Categories You Can't Ignore

1. Clarification on Derived Variables

Example: "AGE is derived based on the difference between RFSTDTC (Reference Start Date) and BRTHDTC (Birth Date), divided by 365.25."

2. Handling of Missing Data

Example: "VISITNUM for unscheduled visits is assigned based on the next available scheduled visit number plus 0.1."

3. Custom Controlled Terminology

Example: "LBTEST includes custom terms for additional lab tests specific to this study, such as 'NLRATIO' (Neutrophil-to-Lymphocyte Ratio)."

4. Explanation of Anomalies or Outliers

Example: "Heart rate values exceeding 200 bpm were confirmed with the investigator as accurate measurements during exercise testing."

5. Mapping Decisions

Example: "AEDECOD was mapped to MedDRA v25.0 using coding guidelines. Non-mappable terms were assigned as 'OTHER.'"

6. Complex or Study-Specific Rules

Example: "DTHFL is populated as 'Y' if the death date is before the cutoff date; otherwise, it is null."

7. Reference to External Data

Example: "EXTRT values were sourced from the sponsor's drug dictionary version 2.0."

8. Additional Guidance for Reviewers

Example: "This dataset contains results from the EQ-5D questionnaire. Higher scores indicate better quality of life."

Best Practices for Writing Effective Comments

Follow these essential guidelines to create clear and useful comments:

Be Concise: Avoid overly lengthy comments; stick to clear, precise descriptions that convey the necessary information without unnecessary words.

Use Plain Language: Ensure your comments are understandable by both technical and non-technical audiences. Avoid jargon unless absolutely necessary.

Provide Context: Always relate the comment directly to the variable or dataset it explains. Make connections clear and explicit.

Standardize Format: Use consistent formatting throughout your documentation for ease of review and better readability.

Include Examples: Where appropriate, provide concrete examples to illustrate complex concepts or rules.

Reference Sources: When referring to external standards or documents, clearly cite the version and source.

Common Pitfalls to Avoid

While documenting in define.xml, avoid these common mistakes:

Vague or ambiguous explanations

Inconsistent terminology across domains

Missing critical derivation steps

Overlooking special cases and exceptions

Real-World Impact

Well-documented comments can:

Reduce regulatory review cycles

Minimize data interpretation questions

Improve study reproducibility

Facilitate knowledge transfer between team members

Looking Ahead

As clinical trials become more complex and regulatory requirements evolve, the importance of clear, comprehensive documentation in define.xml will only increase. Mastering the art of writing effective comments is no longer optional - it's a crucial skill for modern clinical data managers.

Pro Tip: Regular review and updates of your define.xml comments can save countless hours during the submission process and prevent last-minute documentation crises.

Conclusion

The Comments tab in define.xml is your opportunity to provide clarity, context, and completeness to your clinical data documentation. By following these guidelines and best practices, you can create more robust, reviewable, and valuable documentation that stands up to regulatory scrutiny and serves as a valuable resource for your entire study team.

Share your thoughts and experiences with define.xml documentation in the comments below!

Understanding SDTM EX and EC Domain Annotations

2025-01-14T08:00:00.003-05:00

By StudySAS Team | January 7, 2025

Introduction to SDTM Domains

In clinical data management, the Study Data Tabulation Model (SDTM) is a crucial standard for organizing and formatting data to streamline regulatory submissions. Among its various domains, the EX (Exposure) and EC (Exposure Events) domains play significant roles in documenting participant exposures and related events during a study.

This blog post delves into when and how to annotate these domains, providing detailed examples and best practices based on the SDTM Implementation Guide (IG) Version 3.3.

Understanding the EX Domain

The EX (Exposure) domain is essential for capturing detailed information about the administration of investigational products, concomitant medications, or other exposures participants receive during a clinical study.

When to Annotate the EX Domain

Required: If your study involves any form of product or treatment administration.

Commonly used in most clinical trials to document dosage, route, frequency, and duration of treatments.

Includes data elements such as dosage, route of administration, frequency, duration, and start/end dates of treatment.

Examples When EX Domain Annotations are Required

1. Clinical Trials Involving Investigational Products

Example: A Phase III clinical trial testing the efficacy and safety of a new oncology drug.

Data Elements: EXDOSE, EXROUTE, EXSTART, EXEND

2. Studies Involving Concomitant Medications

Example: A clinical trial assessing a new antihypertensive drug where participants continue existing blood pressure medications.

Data Elements: EXTRT, EXDOSE, EXSTART, EXEND

Key Considerations for EX Domain Annotation

Comprehensive Data Capture: Document all relevant aspects of exposure.

Study Protocol Alignment: Ensure annotations align with study protocols.

Regulatory Compliance: Adhere to SDTM IG 3.3 and regulatory guidelines.

Data Quality and Integrity: Maintain consistency and accuracy in data mapping.

Collaboration: Work with clinical operations and data management teams.

Understanding the EC Domain

The EC (Exposure Events) domain is designed to capture specific events or deviations related to participant exposures. These can include missed doses, dose modifications, discontinuations, or any deviations from the prescribed exposure protocol.

When is EC Domain Annotation Required?

Mandatory:

Deviation from exposure protocol, such as missed doses or dose modifications.

Dose adjustments based on participant tolerance, efficacy, or safety concerns.

Temporary interruptions due to adverse reactions or concurrent medical conditions.

Study Design:

Studies with detailed monitoring of exposure compliance.

Trials involving complex exposure regimens like combination therapies.

Studies where exposure deviations significantly impact participant safety or study integrity.

When is EC Domain Annotation Beneficial?

Enhanced Data Transparency: Provides a complete picture of participant exposure and deviations.

Improved Regulatory Compliance: Facilitates smoother regulatory audits and reviews.

Better Risk Management: Helps in early detection of safety signals related to exposure deviations.

Data Integration: Enhances interoperability by linking with other SDTM domains.

Examples and Scenarios for EC Domain Annotation

1. Clinical Trials with Dose Escalation Protocols

Example: A Phase I oncology trial determining the maximum tolerated dose (MTD) of a new chemotherapy agent.

Data Elements: ECMOD, ECRPT

2. Studies Monitoring Treatment Adherence

Example: A study assessing the efficacy of a daily oral antihypertensive medication.

Data Elements: ECSEQ, ECLINKID

Best Practices for EC Domain Annotation

Comprehensive Mapping: Identify and map all potential exposure events from CRFs to EC domain.

Clear Documentation: Document the rationale for each exposure event annotation.

Collaboration: Work with cross-functional teams to ensure accurate annotation.

Standardized Coding: Use controlled terminologies like MedDRA for reasons.

Utilize Automation: Employ tools for automated mapping and validation to reduce errors.

When to Use EC Domain Without EX Domain

While the **EC (Exposure Events)** domain is typically used in conjunction with the **EX (Exposure)** domain, there are specific scenarios where EC annotations might be included without corresponding EX annotations on the **annotated Case Report Form (aCRF)**.

Scenarios for EC Domain Without EX Domain

1. Studies Utilizing External or Predefined Exposure Data

Example: A retrospective observational study assessing exposure to environmental pollutants where exposure data is sourced from external databases.

Why EC Only: Exposures are predefined and documented externally; the study focuses on tracking related events.

Data Elements: ECSEQ, ECSTDY, ECRPT, ECMOD

2. Protocol Deviations Without Specific Exposure Data

Example: A clinical trial where site closures or supply interruptions affect treatment administration but do not alter exposure details.

Why EC Only: Focus on documenting protocol deviations impacting exposure without changing the exposure data.

Data Elements: ECSEQ, ECSTDY, ECRPT, ECMOD

3. High-Level Exposure Event Tracking in Specific Study Designs

Example: A public health study monitoring major exposure-related incidents like widespread environmental changes.

Why EC Only: Records high-level exposure events without individual exposure metrics.

Data Elements: ECSEQ, ECSTDY, ECRPT, ECMOD

Key Considerations and Best Practices

Regulatory Compliance: Ensure alignment with SDTM IG 3.3 and consult regulatory authorities if deviating from standard practices.

Clear Documentation: Justify the exclusion of EX annotations and provide detailed metadata.

Data Integrity: Maintain consistent and meaningful EC records even without EX annotations.

Collaboration: Engage with all relevant teams to ensure a shared understanding and accurate implementation.

Review and Validation: Implement robust validation processes to ensure accuracy and compliance.

Conclusion

Annotating the EX (Exposure) and EC (Exposure Events) domains is pivotal for comprehensive and compliant clinical data management. While the EX domain is generally essential for documenting participant exposures, the EC domain provides additional depth by capturing related events and deviations. Understanding when and how to use these domains, especially the specific cases where EC can be used without EX, ensures data integrity, regulatory compliance, and facilitates robust data analysis.

For more detailed guidelines, always refer to the CDISC SDTM Implementation Guide, Version 3.3 and consult with regulatory bodies as needed.

© 2025 StudySAS. All rights reserved.

Protocol Version Mapping in SDTM Disposition Events: A Comprehensive Guide

2025-01-10T14:29:00.003-05:00

Key Concept: The decision to map protocol version information in SDTM Disposition (DS) domain requires careful analysis of its relationship to disposition events and understanding of data management requirements.

Understanding Protocol Version's Role in Disposition Events

Protocol versions can significantly impact disposition events in clinical trials. Their relationship to these events determines the appropriate mapping strategy within the SDTM structure. This relationship can be categorized into two main types:

Direct Impact Relationship

When protocol version changes directly cause or influence disposition events, such as:

Subject withdrawal due to protocol amendment modifications

Study discontinuation resulting from significant protocol changes

Protocol-mandated subject transfers between treatment arms

Contextual Relationship

When protocol version provides important context but doesn't directly cause the disposition event:

Standard discontinuations occurring under different protocol versions

Administrative changes documented in protocol updates

Background information for data analysis

Decision Framework for Protocol Version Mapping

To determine the appropriate mapping strategy, follow this structured evaluation process:

1. Impact Analysis

Evaluate the protocol version's influence on disposition events by examining:

Direct causality between protocol changes and subject disposition

Regulatory requirements for tracking protocol version information

Statistical analysis requirements for protocol version stratification

2. Data Structure Assessment

Consider the following aspects of your data structure:

Existing variable relationships in the DS domain

Need for protocol version traceability

Impact on data analysis and reporting requirements

Protocol Version Mapping Methods: Decision Guide

Key Principle: The choice of mapping method should be driven by the relationship between the protocol version and the disposition event, regulatory requirements, and analysis needs. Each method serves a specific purpose and has distinct advantages.

Method 1: DSREFID Mapping

Use DSREFID mapping when protocol version information is fundamental to understanding or explaining the disposition event. This method is most appropriate in the following scenarios:

When to Use DSREFID:

The disposition event occurs as a direct result of a protocol amendment or version change. For example, when a subject withdraws because new procedures were introduced in a protocol amendment.

Protocol version changes trigger mandatory subject discontinuation or transfer between treatment arms.

Regulatory requirements specifically mandate tracking the relationship between protocol versions and disposition events.

The protocol version is essential for understanding the context of the disposition decision and will be needed for primary analysis.

Method 2: Supplemental Qualifiers (SUPPDS)

Use supplemental qualifiers when protocol version information provides important context but isn't directly causal to the disposition event. This approach is best suited for:

When to Use SUPPDS:

Protocol version information needs to be preserved for traceability but isn't directly related to the disposition decision.

The information might be needed for secondary analyses or regulatory documentation.

You need to maintain protocol version history without implying direct causality with disposition events.

The protocol version is part of standard documentation requirements but doesn't affect the interpretation of the disposition event.

Method 3: Custom Variables

Consider creating custom variables (with sponsor and regulatory approval) in these situations:

When to Use Custom Variables:

Your study has unique protocol version tracking requirements that don't fit well in standard variables.

You need to capture multiple aspects of protocol versioning (e.g., both amendment number and version date).

Sponsor-specific requirements necessitate specialized protocol version tracking.

The relationship between protocol versions and disposition events needs to be analyzed in ways not supported by standard variables.

Method 4: Comments or DSTERM

Include protocol version information in comments or DSTERM when:

When to Use Comments/DSTERM:

The protocol version provides helpful background information but isn't required for analysis.

You need to provide additional context about protocol version impacts without formal tracking.

The information is purely descriptive and won't be used in analyses.

You need to capture protocol version details in a narrative format.

Decision Support Matrix

Characteristic DSREFID SUPPDS Custom Variable Comments/DSTERM

Direct Causality Required Optional Variable No

Analysis Impact Primary Secondary Study-specific None

Traceability High Medium High Low

Implementation Complexity Medium Low High Low

Mapping Implementation Strategies

Strategy 1: DSREFID Mapping

USUBJID DSSEQ DSTERM DSDECOD DSREFID DSSTDTC

123-45678 1 Protocol Amendment Withdrawal WITHDRAWAL PROT-V2.0-AMD3 2024-01-15

Strategy 2: Supplemental Qualifiers

STUDYID RDOMAIN USUBJID IDVAR IDVARVAL QNAM QVAL

STUDY001 DS 123-45678 DSSEQ 1 PROTVER 2.0

Best Practices and Recommendations

Documentation Requirements

Clearly document the mapping rationale in the Study Data Reviewer's Guide

Maintain consistent protocol version formatting across all domains

Include protocol version mapping decisions in data management plans

Common Pitfalls to Avoid

Inconsistent protocol version formatting across different domains

Overloading DSREFID with non-essential protocol information

Failing to document the mapping rationale adequately

Inconsistent handling of protocol versions across different studies

Conclusion

Successful protocol version mapping in SDTM disposition events requires careful consideration of the relationship between protocol versions and disposition events, clear documentation, and consistent implementation. By following these guidelines and best practices, organizations can ensure accurate and compliant data representation while maintaining traceability and supporting effective analysis.

Understanding EPOCH Assignment in Clinical Trials: The Pre-Consent Data Challenge

2025-01-10T12:56:00.004-05:00

In the world of clinical trials, data management and standardization play crucial roles in ensuring the quality and integrity of research outcomes. One particularly nuanced aspect of this process is the proper assignment of EPOCH values in SDTM (Study Data Tabulation Model) datasets, especially when dealing with pre-consent data.

What is an EPOCH?

An EPOCH in clinical trials represents a distinct period within the study's planned design. It helps organize and contextualize various events, interventions, and findings that occur during the study. Common EPOCHs include SCREENING, TREATMENT, FOLLOW-UP, and others as defined by the study protocol.

Essential EPOCH Characteristics:

It is a standardized way to identify different phases of a study

Each EPOCH represents a planned element of the study design

EPOCHs help establish temporal relationships between different study events

They are crucial for data analysis and interpretation of study results

According to SDTM Implementation Guide 3.3 (Section 4.1.3.1), EPOCH is fundamentally a study-design construct. This means it only applies to events that occur after a subject has formally entered the study through the informed consent process.

The Pre-Consent Data Challenge

Clinical researchers often encounter situations where they need to include data that was collected before a subject provided informed consent. This might include:

Historical medical records

Previous laboratory results

Prior medications

Earlier adverse events

Why Can't We Assign SCREENING EPOCH to Pre-Consent Data?

It might seem intuitive to assign pre-consent data to the SCREENING EPOCH, as it's typically the earliest study phase. However, this would be incorrect for several important reasons:

Key Points:

The screening period is a protocol-defined phase that occurs only after informed consent

Screening activities are specifically planned and conducted according to the study protocol

Pre-consent data collection wasn't performed under the study's procedures and requirements

The Correct Approach: Using Null EPOCH Values

For any data collected before informed consent, the EPOCH variable should be set to null. This approach:

Accurately represents the timing of events relative to study participation

Maintains data integrity and transparency

Complies with SDTM standards and guidelines

Facilitates proper analysis and interpretation of study data

Impact on Data Analysis

Understanding the relationship between EPOCH assignment and informed consent dates is crucial for:

Accurate timeline construction

Proper assessment of inclusion/exclusion criteria

Correct baseline determinations

Valid safety and efficacy analyses

Best Practices for Implementation

When implementing EPOCH assignments in your clinical trial data:

Always document the informed consent date accurately

Establish clear procedures for handling pre-consent data

Include proper date variables to establish the timing of events

Implement quality control checks to ensure correct EPOCH assignments

Document your decisions and rationale in the Study Data Reviewer's Guide

Special Considerations for EPOCH Assignment:

Protocol Amendments: Consider how protocol amendments might affect EPOCH definitions and assignments

Re-screening: Handle cases where subjects may need to be re-screened with appropriate EPOCH assignments

Multiple Informed Consents: Account for situations where subjects might sign multiple consent forms for different study aspects

Data Integration: Ensure consistent EPOCH assignment across all study domains

Regulatory Compliance and Documentation

Proper EPOCH assignment is crucial for regulatory compliance. When preparing submissions:

Ensure your approach is clearly documented in the Study Data Reviewer's Guide (SDRG)

Include explanations for any special handling of EPOCH assignments

Document any deviations from standard EPOCH assignment practices

Maintain traceability between source data and SDTM datasets

Common Challenges and Solutions

Challenge 1: Historical Data Collection
Solution: Clearly identify pre-study data and consistently apply null EPOCH values

Challenge 2: Protocol Amendments
Solution: Document how EPOCH assignments are handled when study design changes

Challenge 3: Multiple Sub-studies
Solution: Develop clear rules for EPOCH assignment across different study components

Challenge 4: Data Integration
Solution: Establish consistent EPOCH assignment rules across all study domains

Conclusion

Proper EPOCH assignment is more than just a technical requirement—it's fundamental to maintaining the scientific integrity of clinical trial data. By correctly handling pre-consent data with null EPOCH values, we ensure our SDTM datasets accurately reflect the study design and subject participation timeline.

Remember: The integrity of clinical trial data depends on accurate representation of when events occurred relative to study participation. Using null EPOCH values for pre-consent data is not just compliant with standards—it's essential for proper data interpretation and analysis.

Last updated: January 2025

The Critical Role of ODS LISTING Close Statements in SAS: Avoiding Comment Width Errors

2025-01-09T15:06:00.007-05:00

The Critical Role of ODS LISTING Close Statements in SAS: Avoiding Comment Width Errors

One of the common challenges SAS programmers face is encountering the error message: "ERROR: Comment width is not between 1 and 200 characters." This error, while seemingly straightforward, can be particularly frustrating when it appears unexpectedly in your SAS log. In this article, we'll dive deep into understanding this error and how proper management of ODS LISTING statements can help you avoid it.

Introduction to ODS

The Output Delivery System (ODS) in SAS enables users to control the appearance and destination of output generated by SAS procedures. ODS allows output to be directed to various destinations like HTML, PDF, RTF, and the default Listing destination. Proper management of these destinations is critical to ensuring clean output and avoiding errors.

Understanding the Error

The error message regarding comment width typically appears when SAS encounters issues with the Output Delivery System (ODS) LISTING destination. While the message suggests a problem with comment width, the root cause often lies in how ODS LISTING is handled in your SAS program.

Why ODS LISTING Matters

The LISTING destination in SAS is the default output destination that creates the traditional SAS output. When not properly closed, it can lead to various issues, including the misleading "comment width" error.

Important: The ODS LISTING destination remains open by default unless explicitly closed. Multiple open instances can cause unexpected behavior and errors.

Best Practices for ODS LISTING Management

Here's how to properly manage ODS LISTING to avoid errors:

/* Close all ODS destinations at the start */ ods _all_ close; /* Open specific destinations as needed */ ods listing; /* Your SAS code here */ proc print data=sashelp.class; run; /* Close the listing destination when finished */ ods listing close;

Common Scenarios That Trigger the Error

Running multiple procedures without closing ODS LISTING between them

Nested ODS LISTING statements without proper closure

Batch processing multiple SAS programs where ODS destinations aren't properly managed

Preventive Measures

To avoid encountering the comment width error, implement these practices:

/* Start with a clean slate */ ods _all_ close; /* Create a macro to manage ODS destinations */ %macro manage_ods; ods listing close; ods listing; %mend; /* Use the macro before critical procedures */ %manage_ods; proc print data=sashelp.class; run; /* Always close at the end */ ods listing close;

Troubleshooting Tips

If you encounter the comment width error despite taking precautions:

Check your SAS log for any unclosed ODS destinations

Verify that your ODS LISTING statements are properly paired (open/close)

Consider adding ODS LISTING management statements at key points in your code

Use the ODS SHOW statement to view currently open destinations

/* Check open ODS destinations */ ods show;

Real-World Example

Consider this scenario: You are running multiple SAS procedures in a batch process and encounter the "comment width" error. By including ods _all_ close; at the start of your program and ensuring proper closure of the LISTING destination with ods listing close;, the error is resolved. This simple adjustment streamlines the output management and eliminates the issue.

Conclusion

While the "comment width" error message might seem cryptic, understanding its relationship with ODS LISTING management is crucial for SAS programming. By implementing proper ODS LISTING close statements and following best practices for ODS destination management, you can avoid this error and ensure smoother execution of your SAS programs.

Remember: Maintaining clean and organized ODS management is the cornerstone of efficient SAS programming. Implement these strategies today to streamline your code, minimize errors, and maximize productivity.

Have you encountered the "comment width" error in your SAS programs? Share your solutions or challenges in the comments below!

Power Up Your Data Cleaning with the SAS COMPRESS Function

2025-01-07T11:26:00.006-05:00

Power Up Your Data Cleaning with the SAS COMPRESS Function

When handling large datasets in SAS, it's common to encounter unwanted characters, extra spaces, or other clutter that can hamper your data analysis. Fortunately, the COMPRESS function helps you clean up your text data efficiently. It can remove, or even keep, specific characters from your strings with minimal effort. Keep reading to learn how you can harness the full potential of the SAS COMPRESS function.

1. Quick Overview of the COMPRESS Function

The COMPRESS function in SAS removes (or optionally keeps) certain characters from a character string. Its basic syntax looks like this:

result_string = COMPRESS(source_string <, characters_to_remove> <, modifiers>);

source_string: The original string you want to modify.

characters_to_remove (optional): A list of specific characters to eliminate.

modifiers (optional): Special flags (e.g., remove digits, punctuation, etc.).

2. Removing Specific Characters

Suppose you have a string containing multiple symbols and you only want to remove a specific one, such as the ampersand (&).

data _null_; original = "Cats & Dogs 123"; no_andsign = compress(original, '&'); put no_andsign=; /* Result: "Cats Dogs 123" */ run;

In this example, we explicitly provide '&' in the second argument, so only ampersands are removed. Spaces, digits, and other characters remain.

3. Removing All Spaces by Default

If you leave out the second argument entirely, COMPRESS automatically removes all spaces (including blank spaces). Here's a simple demonstration:

data _null_; original = "Hello World "; remove_blanks = compress(original); put remove_blanks=; /* Result: "HelloWorld" */ run;

4. Unleashing the Power of Modifiers

Modifiers make COMPRESS extremely powerful, as they allow you to target entire categories of characters with minimal code. Here are some of the most commonly used modifiers:

Modifier Action

A Removes all letters (alphabetic characters).

D Removes all digits (0-9).

P Removes all punctuation.

S Removes all space characters.

U Removes uppercase letters (A-Z).

L Removes lowercase letters (a-z).

K Keeps only the listed characters, instead of removing them.

i Ignore case when identifying characters to remove.

t Trims trailing blanks before removal.

4.1 Removing Digits

For example, if you want to remove all digits from a string:

data _null_; original = "Sales in 2023 increased by 15%"; remove_digits = compress(original, , 'D'); put remove_digits=; /* Result: "Sales in increased by %" */ run;

Notice that digits only are removed; spaces and other punctuation stay in place.

4.2 Removing Punctuation

Removing punctuation is equally straightforward:

data _null_; original = "Hello, World! 2025."; no_punct = compress(original, , 'P'); put no_punct=; /* Result: "Hello World 2025" */ run;

4.3 Combining Modifiers

You can stack multiple modifiers together. For instance, to remove both digits and punctuation:

data _null_; original = "Item #123, Price: $45.67"; remove_digits_punct = compress(original, , 'DP'); put remove_digits_punct=; /* Result: "Item Price " */ run;

4.4 Using the "Keep" Modifier (K)

Instead of specifying which characters to remove, you can flip the logic and tell SAS which characters to keep using K. For example, to keep only digits:

data _null_; original = "Item #123, Price: $45.67"; keep_digits = compress(original, '0123456789', 'K'); put keep_digits=; /* Result: "1234567" */ run;

Alternatively, combine K with D to shorten your code:

data _null_; original = "Item #123, Price: $45.67"; keep_digits = compress(original, , 'KD'); put keep_digits=; /* Result: "1234567" */ run;

5. Practical Scenarios

Email Cleaning: If you need to remove all punctuation (except “@” and “.”) from an email field, you could selectively keep only those symbols, letters, and digits.

Financial Data: Stripping out currency symbols and punctuation from a price field so you can convert it into numeric form for calculations.

Text Mining: Removing digits or punctuation from survey responses to focus on words alone.

6. Performance Considerations

While COMPRESS is handy, be mindful of its usage on extremely large datasets or within tight loops, as repeated calls can be computationally expensive. It’s still typically faster than manually parsing strings, but always weigh whether you really need to remove these characters or if you can handle them with custom formats or other string functions.

7. Putting it All Together

Here’s a quick snippet that removes punctuation, digits, and trailing spaces all at once:

data _null_; original_str = " Hello, SAS 2025! "; /* - 'P' removes punctuation - 'D' removes digits - 't' trims trailing blanks */ cleaned_str = compress(original_str, , 'PDt'); put cleaned_str=; /* Step-by-step: 1) Remove punctuation => " Hello SAS 2025 " 2) Remove digits => " Hello SAS " 3) Trim trailing => " Hello SAS" */ run;

Notice how a simple combination of modifiers can accomplish multiple clean-up tasks at once, giving you a much tidier dataset in just one line of code (though, of course, you see it here laid out clearly in multiple lines just like SAS EG would present it).

Final Thoughts

Whether you're massaging marketing data, cleaning up survey responses, or extracting numeric values from text-heavy fields, the SAS COMPRESS function has you covered. With its powerful modifiers and flexible syntax, it saves both time and effort, leaving you more space to focus on the analytical heavy lifting. Give it a try in your next data-cleaning project—you might be surprised at how much cleaner your logs (and your data) become!

Posted by StudySAS on studysas.blogpost.com

Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets

2025-01-07T10:03:00.005-05:00
Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets
Solving Non-Printable Characters in AETERM/MHTERM for SDTM Datasets

Managing text variables in SDTM domains such as AETERM (for Adverse Events) or MHTERM (for Medical History) can be challenging when non-printable (hidden) characters sneak in. These characters often arise from external data sources, copy-pasting from emails, encoding mismatches, or raw text that includes ASCII control characters. In this post, we’ll explore methods to detect and remove these problematic characters to ensure your SDTM datasets are submission-ready.

1. Identifying Non-Printable Characters

Non-printable characters generally fall within the ASCII “control” range:

Hex range: 00–1F and 7F

Decimal range: 0–31 and 127

In SAS, you can detect these characters by examining their ASCII values using RANK(), or by leveraging built-in functions like ANYCNTRL(). Below is an example snippet that loops through the first 100 observations of AETERM, logs the position of any non-printable character, and displays its ASCII rank:

data check_chars; set yourlib.ae (obs=100); /* For demonstration, adjust these lengths to fit your actual data. */ length test_char $1 non_print_char_flag $200; do i = 1 to length(aeterm); test_char = substr(aeterm, i, 1); /* Check for non-printable ASCII control characters (0–31, 127) */ if rank(test_char) < 32 or rank(test_char) = 127 then do; /* Build a single message string */ non_print_char_flag = catx(' ', 'Non-printable found in USUBJID=', usubjid, 'at position=', put(i, best.), 'character=', test_char, 'rank=', put(rank(test_char), best.) ); /* Write the message string to the SAS log */ put non_print_char_flag; end; end; run;

2. Removing Non-Printable Characters

Once you confirm non-printable characters are present, you can remove them in various ways. Below are three common approaches:

A. Using COMPRESS with Character Classes

The simplest way is to use the COMPRESS function with the 'c' modifier, which removes control characters (ASCII 0–31, 127):

data clean; set yourlib.ae; /*aeterm_clean = compress(aeterm, , 'c'); */ aeterm_clean = compress(aeterm, , 'kw'); /* 'c' removes control characters (ASCII 0–31, 127) */ run;

B. Using a Perl Regular Expression (PRXCHANGE)

A more targeted approach uses PRXPARSE and PRXCHANGE. For instance, the following regex removes control characters in the ranges 00–08, 0B, 0C, 0E–1F, and 7F:

data clean; set yourlib.ae; /* Remove ASCII 00–08, 0B, 0C, 0E–1F, and 7F */ retain re_removeControls PRXPARSE('s/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+//o'); aeterm_clean = prxchange(re_removeControls, -1, aeterm); run;

C. Using TRANWRD Iteratively

For legacy or very narrow use cases, you might remove characters with multiple TRANWRD() calls. However, this approach quickly becomes cumbersome if many different ASCII control characters need to be removed.

3. Incorporating Into SDTM Mapping Programs

Typically, these solutions are applied during the data transformation from raw data to final SDTM domains. For instance, in creating your AE domain:

data sdtm.ae; set raw.ae; /* Remove non-printable characters from AETERM */ AETERM = compress(AETERM, , 'c'); /* Additional mappings and derivations here */ run;

You can do the same in other domains (e.g., MH, CM) for consistent data cleaning.

4. Additional Tips

Strip leading/trailing spaces: After removing hidden characters, consider using STRIP() or LEFT()/RIGHT() to ensure no unintended spaces remain.

Compress multiple blanks: If control character removal results in extra spaces, COMPBL() can reduce multiple blanks to a single space.

Document your approach: Regulatory bodies often require justification that data cleaning preserves the meaning of reported terms. Keep clear records of any cleaning steps performed.

Use consistently: Apply the same cleaning methodology across all relevant domains to avoid inconsistencies.

By following these steps, you’ll ensure cleaner, more compliant SDTM datasets, minimize the risk of downstream submission issues, and maintain higher data quality for your clinical studies.

Posted by StudySAS on studysas.blogpost.com

Learn how to view SAS dataset labels without opening the dataset directly in a SAS session. Easy methods and examples included!

2024-12-17T09:34:00.003-05:00
Quick Tip: See SAS Dataset Labels Without Opening the Data

Quick Tip: See SAS Dataset Labels Without Opening the Data

When working with SAS datasets, checking the dataset label without actually opening the data can be very useful. Whether you're debugging or documenting, this small trick can save you time and effort!

1. Use PROC CONTENTS

PROC CONTENTS is the most common and straightforward way to view the dataset label.

proc contents data=yourlib.yourdataset; run;

The dataset label will appear in the output as the field: Data Set Label.

2. Query DICTIONARY.TABLES or SASHELP.VTABLE

For a programmatic approach, use the DICTIONARY.TABLES table or SASHELP.VTABLE view to query dataset metadata.

Example Using PROC SQL

proc sql; select memname, memlabel from dictionary.tables where libname='YOURLIB' and memname='YOURDATASET'; quit;

Example Using SASHELP.VTABLE

proc print data=sashelp.vtable; where libname='YOURLIB' and memname='YOURDATASET'; var memname memlabel; run;

Both methods will show the dataset's name and its label.

3. Use PROC DATASETS

For advanced users, PROC DATASETS can display dataset attributes, including labels:

proc datasets library=yourlib; contents data=yourdataset; run; quit;

Why Is This Helpful?

Quickly check dataset metadata without loading or viewing the data.

Useful for documenting datasets during large projects.

Helpful in automation scripts for SAS programming.

Conclusion

Using these methods, you can easily view SAS dataset labels without opening the data. Whether you prefer PROC CONTENTS, querying metadata tables, or PROC DATASETS, the choice depends on your workflow.

Happy coding! If you found this tip useful, don’t forget to share it with your fellow SAS programmers.

Looking for more SAS programming tricks? Stay tuned for more posts on Rupee Stories!

Leveraging a SAS Macro to Generate a Report on Non-Printable Characters in SDTM Datasets

2024-12-12T15:05:00.004-05:00

Detecting Non-Printable Characters in SDTM Datasets Using SAS

Non-printable characters in datasets can lead to errors and inconsistencies, especially in the highly regulated environment of clinical trials. This blog post demonstrates how to create a SAS program that identifies non-printable characters in all SDTM datasets within a library and generates a comprehensive report.

Why Detect Non-Printable Characters?

Non-printable characters, such as ASCII values below 32 or above 126, can cause issues during data validation, regulatory submissions, and downstream processing. Detecting them early ensures the quality and compliance of your SDTM datasets.

The SAS Program

The following SAS program processes all SDTM datasets in a library and generates a combined report of non-printable characters, including:

Dataset name: The dataset where the issue exists.

Variable name: The variable containing non-printable characters.

Row number: The row where the non-printable character is found.

Positions: The exact position(s) of non-printable characters with their ASCII values.

Program Code

%macro find_nonprintable_all(libname=, output=); /* Get a list of all datasets in the library */ proc sql noprint; select memname into :dsetlist separated by ' ' from dictionary.tables where libname = upcase("&libname"); quit; /* Create a combined report */ data &output.; length _dataset $32 _varname $32 _rownum 8 _position_list $500; retain _dataset _varname _rownum _position_list; call missing(_dataset, _varname, _rownum, _position_list); stop; run; /* Loop through each dataset and find non-printable characters */ %let count = %sysfunc(countw(&dsetlist)); %do i = 1 %to &count; %let dset = %scan(&dsetlist, &i); data temp_report; set &libname..&dset.; length _dataset $32 _varname $32 _rownum 8 _position_list $500; retain _dataset _varname _rownum _position_list; _dataset = "&dset"; array charvars {*} _character_; /* Select all character variables */ do i = 1 to dim(charvars); _varname = vname(charvars[i]); _rownum = _n_; _position_list = ''; /* Check each character in the variable */ do j = 1 to length(charvars[i]); charval = substr(charvars[i], j, 1); ascii_val = rank(charval); /* Flag non-printable characters */ if ascii_val < 32 or ascii_val > 126 then do; _position_list = catx(', ', _position_list, "Position=" || put(j, best.) || " (ASCII=" || ascii_val || ")"); end; end; /* Output if any non-printable characters are found */ if not missing(_position_list) then output; end; drop i j charval ascii_val; run; /* Append to the combined report */ proc append base=&output. data=temp_report force; run; /* Clean up temporary dataset */ proc datasets lib=work nolist; delete temp_report; quit; %end; %mend; /* Example usage */ %find_nonprintable_all(libname=sdtm, output=nonprintable_combined_report); /* Review the combined report */ proc print data=nonprintable_combined_report noobs; title "Non-Printable Characters Report for All Datasets in the Library"; run;

How It Works

The program processes each dataset in the specified library, examines all character variables for non-printable characters, and records their positions in a combined report.

Output

The final report contains the following columns:

_dataset: Name of the dataset.

_varname: Name of the variable.

_rownum: Row number.

_position_list: Details of non-printable character positions and ASCII values.

Conclusion

Using this SAS program, you can proactively identify and address non-printable characters in SDTM datasets, ensuring data integrity and compliance. Feel free to adapt this program for your specific needs.

SDTM aCRF Annotation Checklist

2024-12-06T05:47:00.001-05:00
SDTM aCRF Annotation Checklist

SDTM aCRF Annotation Checklist

By Sarath Annapareddy

Introduction

Creating an SDTM Annotated Case Report Form (aCRF) is a critical step in clinical trial data submission. It ensures that data collected in the CRF maps correctly to SDTM domains, adhering to regulatory and CDISC standards. This checklist serves as a guide to creating a high-quality SDTM aCRF ready for regulatory submission.

1. General Formatting

Ensure the aCRF uses the latest SDTM IG version relevant to the study.

The document should be clean, legible, and free of overlapping annotations.

Page numbers in the aCRF should align with the actual CRF pages.

Annotations must be in English, clear, and consistently formatted.

Use color coding to differentiate domain mappings, derived variables, and special-purpose annotations.

2. Domain-Level Annotations

Annotate each field on the CRF with the corresponding SDTM variable name (e.g., DMAGE, LBTEST).

Ensure every field includes an appropriate domain prefix (e.g., DM, AE).

Unmapped fields should be labeled with "Not Mapped" or "NM".

Ensure proper usage of variable cases (e.g., all uppercase for SDTM variable names).

3. Data Collection Fields

Map demographic fields (e.g., SEX, RACE) to the `DM` domain.

Adverse event fields (e.g., event name, severity) should map to the `AE` domain.

Laboratory test results and units should map to the `LB` domain.

Exposure data (e.g., drug start/stop dates) must align with the `EX` domain.

Use the `DV` domain for protocol deviations and the `MH` domain for medical history.

4. Special-Purpose Domains

SUPPQUAL should be used for non-standard variables.

RELREC annotations are required for defining relationships between domains.

Free-text comments should map to the `CO` domain.

Trial design fields should map to domains like `TA`, `TV`, and `TS`.

5. Derived and Computed Variables

Derived variables must be clearly labeled (e.g., "DERIVED" or "CALCULATED").

Ensure annotations for variables like BMI reference all contributing fields (e.g., height and weight).

Visit variables (e.g., VISITNUM, VISITDY) should align with RFSTDTC.

6. Date and Time Variables

All date fields must follow ISO 8601 format (e.g., AESTDTC, EXSTDTC).

Derived date variables like VISITDY should be calculated relative to RFSTDTC.

7. Validation and Quality Control

Validate the aCRF against the finalized SDTM datasets.

Ensure alignment with the Define.xml document.

Conduct reviews by the programming and data management teams.

Perform a completeness check to ensure no fields are left unannotated.

8. Regulatory Submission Readiness

Ensure compliance with the requirements of regulatory authorities (e.g., FDA, PMDA).

Submit the aCRF in a searchable, bookmarked PDF format.

Verify that all color-coded annotations are visible in grayscale for printed versions.

Include a cover page with the study title, protocol number, and version.

Conclusion

A well-annotated SDTM aCRF is crucial for successful regulatory submissions. By following this checklist, you can ensure your aCRF meets compliance requirements and demonstrates traceability between the CRF, datasets, and Define.xml. This meticulous process not only ensures regulatory approval but also enhances the credibility of your clinical trial data.

© 2024 Rupee Stories. All rights reserved.

Area	FDA	PMDA
Validation Rules	FDA validator rule set	PMDA-specific published rule set
Define.xml	Expected as part of the submission metadata package	Expected with style sheet and checked closely against datasets
Reviewer Guide	Expected and important for review context	More operational, should document validation setup and findings clearly
Issue Classification	Focus on explanation of unresolved issues	Severity model matters, Reject/Error/Warning
Submission Risk	Findings generally drive questions and clarification	Reject findings can block or suspend review until fixed
Rule Version Handling	Usually less visible in submission narrative	Timing matters because the acceptable engine and rule context can change
Encoding	Often English-only in practice	Needs more care when non-English content is present
Units	Conventional unit expectations more common	SI-unit expectations more visible
Validation Scope	Datasets and metadata must be reviewable	Cross-checks across datasets, metadata, and XML structure matter more visibly

Area	Common weak wording	What is missing
Study Day (--DY)	Derived from reference start date	Formula, sign convention, partial date handling
Partial dates	Partial dates were imputed	Method, scope, and where the imputed value is used
Lab standardization	Standard unit	Conversion rule, order of operations, flag impact
Cross-domain rules	Separate domain notes only	Whether the same concept behaves consistently across domains
Traceability	Relationship to study drug	Collected vs assigned vs sponsor-derived logic

Protocol (Future Tense)	cSDRG (Past Tense)
"This will be a randomized, double-blind study that will enroll 100 subjects..."	"This was a randomized, double-blind study that enrolled 98 subjects..."
"Subjects will receive treatment for 12 weeks..."	"Subjects received treatment for 12 weeks, with a mean treatment duration of 11.2 weeks..."
"Blood samples will be collected at baseline and weeks 4, 8, and 12."	"Blood samples were collected at baseline and weeks 4, 8, and 12, with additional unscheduled sampling as needed."

Characteristic	DSREFID	SUPPDS	Custom Variable	Comments/DSTERM
Direct Causality	Required	Optional	Variable	No
Analysis Impact	Primary	Secondary	Study-specific	None
Traceability	High	Medium	High	Low
Implementation Complexity	Medium	Low	High	Low

Modifier	Action
A	Removes all letters (alphabetic characters).
D	Removes all digits (0-9).
P	Removes all punctuation.
S	Removes all space characters.
U	Removes uppercase letters (A-Z).
L	Removes lowercase letters (a-z).
K	Keeps only the listed characters, instead of removing them.
i	Ignore case when identifying characters to remove.
t	Trims trailing blanks before removal.