Replace GTFS spec scraper with hand-authored TypeScript library (+ implementation status doc) #4
Labels
No labels
Compat/Breaking
Kind/Bug
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
gtfs.zone/coloring-book#4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Replace the fragile two-step auto-scraping pipeline (
scrape-gtfs-spec.ts→gtfs-spec.json→generate-gtfs-types.ts→gtfs.ts) with a hand-authored TypeScript spec library. The spec lives insrc/gtfs-spec/files/— one.tsfile per GTFS file — and is authored verbatim from the official reference at https://gtfs.org/documentation/schedule/reference/. A runtime adapter replaces the codegen step by deriving all existing app exports (GTFS_PRIMARY_KEYS,GTFS_FIELD_TYPES,GTFSSchemas,GTFS_ENUMS, etc.) from the spec data at startup. The spec is structured to eventually be extracted as a standalone zero-dependency npm package. Adocs/gtfs-implementation-status.mdfile tracks which files/features are supported in the UI.Why manual? The scraper silently mis-parses edge cases — for example,
continuous_pickupvalue1 or emptywas scraped as just1, losing the "or empty" semantic. The reference changes rarely (a few times a year at most), so manual authorship with phased review is more accurate and maintainable than fixing an HTML parser.Relevant Context
Current pipeline (being replaced)
scripts/scrape-gtfs-spec.ts— scrapes gtfs.org HTML →src/gtfs-spec.json← deleted in Phase 1scripts/generate-gtfs-types.ts— reads JSON → generatessrc/types/gtfs.ts+src/types/gtfs-enums.ts← deleted in Phase 1src/gtfs-spec.json— scraped intermediate; kept as read-only reference until Phase 15, then deletedsrc/types/gtfs.ts— 1700-line generated file; stays until Phase 15 (app depends on it)src/types/gtfs-enums.ts— generated enum registry; stays until Phase 15Files that stay (hand-written, referenced by adapter)
src/types/gtfs-field-types.ts—GTFSFieldTypeenum +GTFS_FIELD_TYPE_METADATAwithzodValidatorper type. The adapter uses these to build Zod schemas programmatically.src/types/gtfs-entities.ts—z.infer<>types; updated in Phase 15 to point at new schemasApp consumers (updated in Phase 15)
src/utils/field-component.ts:381—generateFieldConfigsFromSchema()usesGTFSSchemas,GTFS_FIELD_TYPES,GTFS_PRIMARY_KEYS,GTFS_ENUMSsrc/utils/zod-tooltip-helper.ts:137—getGTFSFieldDescription()usesGTFSSchemassrc/modules/gtfs-database.ts— usesGTFS_FILES, primary key mappingsNew file structure
Key type decisions
GTFSPresence = 'Required' | 'Optional' | 'Conditionally Required' | 'Conditionally Forbidden' | 'Recommended'—Recommendedis first-class in the type but the adapter treats it as Optional for Zod purposes (no enforcement)presenceCondition?: string— prose condition copied verbatim from reference (e.g. "Required if agency.txt has multiple agencies")enumValues?: Array<{ value: number | string; label: string; description: string }>— each part stored separately, enabling richer UI renderingallowEmpty?: boolean— for fields likearrival_time/departure_timethat can be blank under conditions (distinct from Optional, which means the column may be omitted entirely). Adapter generatesz.union([z.literal(''), baseValidator]).describe(description).optional()for these.foreignKey?: { file: string; field: string }— parsed from "Foreign ID referencing X.Y" patternsformat?: 'csv' | 'geojson'onGTFSFileSpec— forlocations.geojsonspecVersion: stringonGTFSSpec— records the reference date this was authored againstPhase 1 — Foundation: types, file structure, delete scraper/codegen
Delete the generation scripts immediately (clean break). Define the spec TypeScript types. The existing
gtfs.ts/gtfs-enums.tsremain untouched — the app still depends on them through Phase 14.src/gtfs-spec/types.tswith:GTFSPresence,GTFSEnumValue,GTFSFieldSpec,GTFSFileSpec,GTFSSpecsrc/gtfs-spec/index.tsstub (emptygtfsSpecexport for now — files added incrementally)scripts/scrape-gtfs-spec.tsscripts/generate-gtfs-types.tsgenerate-gtfs-typesscript frompackage.jsonnpm run typecheckto confirm nothing is brokenGotchas:
src/gtfs-spec.jsonis kept as read-only reference until Phase 15. Do not add any imports of the new spec into the app yet — that happens in Phase 15.Phase 2 — agency.txt + feed_info.txt
Two simple files to establish the pattern and validate the type structure before tackling complex files.
src/gtfs-spec/files/agency.ts— 8 fields:agency_id(Unique ID, Conditionally Required, primary key),agency_name(Text, Required),agency_url(URL, Required),agency_timezone(Timezone, Required),agency_lang(Language code, Optional),agency_phone(Phone number, Optional),agency_fare_url(URL, Optional),agency_email(Email, Optional)src/gtfs-spec/files/feed-info.ts— 9 fields:feed_publisher_name,feed_publisher_url,feed_lang,default_lang,feed_start_date,feed_end_date,feed_version,feed_contact_email,feed_contact_urlsrc/gtfs-spec/index.tsGotchas:
agency_idhaspresenceConditionfrom reference.feed_info.txtitself is Conditionally Required — captured at file level withpresenceCondition.default_langis Optional per reference.feed_start_dateandfeed_end_dateare Recommended (not Required).Phase 3 — stops.txt
Complex file with enums, foreign keys, and nuanced conditional presence — worth its own phase. Implemented 16 fields total (verified against live reference). Added
stop_access(Conditionally Forbidden, enum 0–1), a newer field not present in the old scraper. Verifiedlocation_typevalue 0 "or empty" equivalence captured in enum description.wheelchair_boardingenum descriptions cover all three location contexts (parentless stops, child stops, station entrances).zone_idkept as Optional (per current reference) but with apresenceConditionnoting fare_rules.txt dependency.src/gtfs-spec/files/stops.ts— ~20 fieldslocation_typeenum: 5 values (0–4), each with label and full description. Value0haslabel: 'Stop (or Platform)'— verify "or empty" semantic is captured in descriptionparent_station: foreign key{ file: 'stops.txt', field: 'stop_id' }, Conditionally Requiredwheelchair_boardingenum: 3 values (0–2); semantics differ bylocation_type— copy description verbatimstop_name,stop_lat,stop_lon,parent_station,stop_timezone,zone_id: all havepresenceConditionprosesrc/gtfs-spec/index.tsGotchas: Do NOT rely on
gtfs-spec.jsonas source of truth for this file — verify every enum value against the live reference, as the scraper was known to mis-parse this file.Phase 4 — routes.txt + trips.txt
Two medium-complexity files.
routes.txthas the importantcontinuous_pickup/continuous_drop_offenum bug fix. Implemented all fields verbatim from reference.continuous_pickup/continuous_drop_offvalue1label is'No continuous stopping pickup'/'No continuous stopping drop off'with "An empty value is equivalent to 1." in the description.network_idisConditionally Forbiddenwith mutual exclusivity note.trips.txtincludesshape_idasConditionally Requiredwith foreign key toshapes.txt.src/gtfs-spec/files/routes.ts— ~15 fieldsroute_typeenum: values 0–7, 11, 12 (check reference for any extended route type notes)continuous_pickup/continuous_drop_offenums: 4 values — value1label must be'No continuous stopping pickup'(the reference says "1 or empty"); capture the "or empty" equivalence in the description verbatimnetwork_id: note mutual exclusivity withnetworks.txt/route_networks.txtin descriptionsrc/gtfs-spec/files/trips.ts— ~10 fieldsdirection_idenum (0–1),bikes_allowedenum (0–2),wheelchair_accessibleenum (0–2)shape_id: foreign key{ file: 'shapes.txt', field: 'shape_id' }src/gtfs-spec/index.tsPhase 5 — stop_times.txt
The most conditional file in the spec. Primary use case for
allowEmpty. Implemented 18 fields total.stop_id/location_group_id/location_idmutual exclusivity captured in bothpresenceConditionanddescriptionfor each.continuous_pickup/continuous_drop_offareConditionally Forbidden(forbidden when pickup/drop-off windows are defined) rather than Optional as in routes.txt.pickup_booking_rule_idanddrop_off_booking_rule_idadded for flex transit booking rules.src/gtfs-spec/files/stop-times.ts— ~15 fieldsarrival_time: type'Time',allowEmpty: true, presence'Conditionally Required'with prose conditiondeparture_time: same pattern asarrival_timestop_id/location_id/location_group_id: mutually exclusive — capture in each field's descriptionpickup_type/drop_off_typeenums: 4 values eachcontinuous_pickup/continuous_drop_offenums: same 4 values asroutes.txt(same "or empty" issue on value 1)timepointenum: 0–1src/gtfs-spec/index.tsPhase 6 — calendar.txt + calendar_dates.txt
Service schedule files. Simple structure but mutually conditionally required with respect to each other. Both files authored verbatim from reference.
calendar.txtis Conditionally Required (unless all dates defined incalendar_dates.txt).calendar_dates.txtis Conditionally Required (ifcalendar.txtis omitted). Theservice_idincalendar_dates.txtuses aforeignKeytocalendar.txt— this captures the relationship when both files are used together.src/gtfs-spec/files/calendar.ts— 10 fields:service_id(Unique ID, Required, primary key),monday–sunday(each Enum 0/1, Required),start_date(Date, Required),end_date(Date, Required)src/gtfs-spec/files/calendar-dates.ts— 3 fields:service_id(ID, Required),date(Date, Required),exception_type(Enum 1/2, Required)GTFSFileSpec.presenceConditionfor both filessrc/gtfs-spec/index.tsPhase 7 — shapes.txt + frequencies.txt + transfers.txt
Three files grouped by relationship to trips.
transfers.txthas the most conditional logic. All three files authored verbatim from reference.transfers.txtcaptures the fulltransfer_typeenum (0–5) including the newer in-seat transfer types (4 and 5), withfrom_trip_id/to_trip_idmarked Conditionally Required for those values.frequencies.txtuses a foreign key totrips.txt.src/gtfs-spec/files/shapes.ts— 5 fields:shape_id(ID, Required, primary key),shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveledsrc/gtfs-spec/files/frequencies.ts— 6 fields:trip_id,start_time,end_time,headway_secs,exact_times(Enum 0/1)src/gtfs-spec/files/transfers.ts— ~9 fields:transfer_typeenum (0–5), multiple foreign keys, several Conditionally Required fieldssrc/gtfs-spec/index.tsPhase 8 — pathways.txt + levels.txt
Station interior navigation files.
pathways.txtis among the most complex in the spec. Implemented 12 fields for pathways.txt (pathway_id, from_stop_id, to_stop_id, pathway_mode, is_bidirectional, length, traversal_time, stair_count, max_slope, min_width, signposted_as, reversed_signposted_as).levels.txtis marked Conditionally Required since it's only needed when stops reference level_id.src/gtfs-spec/files/pathways.ts— ~15 fieldspathway_modeenum: 7 values (1–7)is_bidirectionalenum: 0–1pathway_mode— capture allpresenceConditionstringssrc/gtfs-spec/files/levels.ts— 3 fields:level_id(Unique ID, Required, primary key),level_index(Float, Required),level_name(Text, Optional)src/gtfs-spec/index.tsPhase 9 — fare_attributes.txt + fare_rules.txt
Legacy fare model (Fares v1). Still widely used. Fairly straightforward.
transfersenum infare_attributes.txtincludes an empty-string value for unlimited transfers — stored asvalue: ''with label'Unlimited'.fare_rules.txtis all foreign keys;origin_id,destination_id, andcontains_idall referencestops.txt'szone_idfield.src/gtfs-spec/files/fare-attributes.ts— 7 fields:fare_id(Unique ID, primary key),price,currency_type,payment_method(Enum 0/1),transfers(Enum 0/1/2/empty),agency_id,transfer_durationsrc/gtfs-spec/files/fare-rules.ts— 5 fields: all foreign keys (fare_id, route_id, origin_id, destination_id, contains_id)src/gtfs-spec/index.tsGotchas:
transfersinfare_attributes.txthas an "empty" option (unlimited transfers) — noted in description and stored asvalue: ''.Phase 10 — Fares v2: fare_media.txt, fare_products.txt, fare_leg_rules.txt, fare_leg_join_rules.txt, fare_transfer_rules.txt
The newer Fares v2 model.
fare_leg_rules.txtandfare_transfer_rules.txtare the most complex. These files are entirely separate from Fares v1 — noted in each file's description.fare_products.txthas a composite primary key (fare_product_id+rider_category_id+fare_media_id) —fare_product_idis markedisPrimaryKey: trueas the identifying field.fare_transfer_rules.txthas 8 fields includingduration_limit_type(enum 0–3, measuring departure/arrival combinations) andfare_transfer_type(enum 0–2, representing A+AB, A+AB+B, and AB cost models).duration_limitandduration_limit_typeare mutually Conditionally Required — each requires the other.fare_leg_join_rules.txthas 4 fields including an optionalfare_transfer_rule_idback-reference. Thefare_leg_rules.txtnetwork_idforeign key points tonetworks.txt(which cross-referencesroutes.network_idas well).src/gtfs-spec/files/fare-media.ts— ~5 fields:fare_media_id(primary key),fare_media_name,fare_media_type(Enum 0–4)src/gtfs-spec/files/fare-products.ts— ~5 fields:fare_product_id(primary key),fare_product_name,fare_media_id,amount,currencysrc/gtfs-spec/files/fare-leg-rules.ts— ~10 fields with complex conditionalitysrc/gtfs-spec/files/fare-leg-join-rules.ts— small, ~4 fieldssrc/gtfs-spec/files/fare-transfer-rules.ts— ~10 fields:duration_limit_typeenum (0–3),fare_transfer_typeenum (0–2)src/gtfs-spec/index.tsPhase 11 — timeframes.txt, rider_categories.txt, areas.txt, stop_areas.txt, networks.txt, route_networks.txt
Six compact files, mostly 2–5 fields each. All authored verbatim from the reference.
timeframes.txthas 4 fields includingstart_time/end_time(both Conditionally Required — each requires the other) and aservice_idforeign key tocalendar.txt.rider_categories.txthas 6 fields including anis_default_fare_containerenum (0/1) andmin_age/max_age.areas.txtandstop_areas.txtare straightforward 2-field files.networks.txtandroute_networks.txtare bothConditionally ForbiddenwithpresenceConditioncapturing the mutual exclusion withroutes.network_id.src/gtfs-spec/files/timeframes.tssrc/gtfs-spec/files/rider-categories.tssrc/gtfs-spec/files/areas.ts— 2 fields:area_id(primary key),area_namesrc/gtfs-spec/files/stop-areas.tssrc/gtfs-spec/files/networks.ts— 2 fields:network_id(primary key),network_name; file is Conditionally Forbidden — capture inGTFSFileSpec.presenceConditionsrc/gtfs-spec/files/route-networks.ts— Conditionally Forbidden, same mutual exclusion asnetworks.txtsrc/gtfs-spec/index.tsPhase 12 — location_groups.txt, location_group_stops.txt, booking_rules.txt
Flexible transit (on-demand) files.
booking_rules.txthas 15 fields (not ~20 as estimated — the reference has exactly 15). Many are Conditionally Required/Forbidden based onbooking_typevalue (0=real-time, 1=same-day advance, 2=prior day).prior_notice_service_idis a foreign key tocalendar.txtfor counting business days vs. calendar days.location_group_stops.txthas a composite primary key (location_group_id+stop_id).src/gtfs-spec/files/location-groups.ts— ~3 fields:location_group_id(primary key),location_group_namesrc/gtfs-spec/files/location-group-stops.ts— 2 fields:location_group_id,stop_idsrc/gtfs-spec/files/booking-rules.ts— 15 fields:booking_rule_id(primary key),booking_typeenum (0–2), many fields Conditionally Required/Forbidden based onbooking_typevaluesrc/gtfs-spec/index.tsPhase 13 — translations.txt + attributions.txt
Final two standard CSV files. Both are compact administrative files.
src/gtfs-spec/files/translations.ts— 6 fields:table_name(Enum — list of GTFS table names),field_name,language,translation,record_id,record_sub_id,field_valuesrc/gtfs-spec/files/attributions.ts— ~8 fields:attribution_id(primary key),agency_id,route_id,trip_id,organization_name,is_producer/is_operator/is_authority(each Enum 0/1),attribution_url,attribution_email,attribution_phonesrc/gtfs-spec/index.tsGotchas:
translations.txt'stable_namefield is an enum of GTFS filenames — copy verbatim from reference.translationandfield_valuehave compound type'Text or URL or Email'— no single GTFSFieldType covers this; the adapter falls back viamapGTFSTypeStringwhich normalizes it toText/z.string().Phase 14 — Runtime adapter
Write
src/gtfs-spec/adapter.tsto derive all existing app exports from the spec at runtime. This replaces the codegen step entirely. The output must be functionally identical to whatgtfs.tsandgtfs-enums.tscurrently export so that Phase 15 is a straightforward swap.Discovered during implementation:
In Zod v4 (used in this project),
describe()sets the description as a direct property (schema.description) rather than in_def.description. The existingzod-tooltip-helper.tsalready handles this correctly (checksfield.descriptionbeforefield._def.description).For
allowEmptyfields, the.describe()call must be placed on the union (z.union([...]).describe(...).optional()) not the inner validator, so the tooltip helper can find it after unwrappingZodOptional.Two spec files were missing
isPrimaryKey: true:timeframes.txt(timeframe_group_id) andfare_leg_rules.txt(leg_group_id) — both fixed during this phase.The compound type
'Non-null integer'(pathwaysstair_count) is not inGTFSFieldType— handled by falling back tomapGTFSTypeString, which maps it toInteger.Create
src/gtfs-spec/adapter.tsexporting:deriveGTFSPrimaryKeys(spec)→Record<string, string>(fromisPrimaryKey: truefields)deriveGTFSFieldTypes(spec)→{ [filename]: { [fieldName]: string } }(fieldtypestrings)deriveGTFSRelationships(spec)→ foreign key map (fromforeignKeyon fields)deriveGTFSTables(spec)→{ AGENCY: 'agency.txt', ... }(filename → UPPER_CASE key, strip.txt, replace spaces/hyphens with_)deriveGTFSFiles(spec)→string[]deriveGTFSEnums(spec)→Record<fieldName, GTFSEnumValue[]>(from allenumValuesacross all files)deriveGTFSSchemas(spec, fieldTypeMeta)→Record<filename, ZodObject>— builds Zod schemas usingGTFS_FIELD_TYPE_METADATA[field.type].zodValidator, adds.describe(field.description), applies.optional()for non-Required presence, and forallowEmpty: truefields wraps asz.union([z.literal(''), baseValidator]).describe(description).optional()deriveGTFSFileInfos(spec, schemas)→GTFSAdapterFileInfo[]— full file metadata + schemas for replacingGTFS_FILESManually verify output matches current
gtfs.tsexports foragency.txtandstops.txt(spot check) — all 19 primary keys match exactly, agency/stops schemas verifiedPhase 15 — App integration: swap imports, delete generated files
Replace the generated file exports with adapter-derived values. Import paths kept stable by making
gtfs.tsandgtfs-enums.tsthin re-export wrappers rather than updating every import site across the app.Discovered during implementation:
gtfs-entities.tscould not usez.infer<>on the dynamically-built schemas (TypeScript cannot infer specific field shapes from runtime-derivedZodObjects). All entity types are instead aliased toRecord<string, any>with an eslint-disable comment. Runtime validation is handled by the derived Zod schemas; TypeScript static typing is intentionally loose here.gtfs.tswas extended beyond thin re-exports:GTFSFilePresenceenum kept for backward compat (values are identical toGTFSPresencestrings), individual*Schemaexports added as aliases intoGTFSSchemas[filename]for consumers that destructure them,GTFS_TABLESretained as an explicitas constliteral object (rather than usingderiveGTFSTables) to keep narrow string-literal types, and utility functionsgetFieldDescription,getFileSchema,getAllFieldDescriptionsadded to centralize schema access patterns.src/gtfs-spec.jsondeleted as planned.Update
src/types/gtfs.ts: remove all generated content, replace with adapter-derived exports (thin wrapper with backward-compat additions)Update
src/types/gtfs-enums.ts: thin re-export ofderiveGTFSEnums(gtfsSpec)outputUpdate
src/types/gtfs-entities.ts: all entity types aliased toRecord<string, any>(static inference not possible for dynamic schemas)Delete
src/gtfs-spec.jsonRun
npm run typecheckand fix any type errorsSmoke-test the running app: open a feed, verify fields render correctly, confirm tooltips show descriptions, confirm enum dropdowns show correct options
Phase 16 — Documentation
docs/gtfs-implementation-status.mdwith a table:| File | File Presence | UI Support | Notes |— one row per GTFS file,UI Supportvalues:Full,Partial,NoneREADME.mdwith a link to that filenpm run generate-gtfs-typesfromREADME.mdif present (none found)Phase 17 — locations.geojson (final)
locations.geojsonis a GeoJSON file — not tabular, no CSV fields in the GTFS sense. Include a file entry with file-level metadata only.src/gtfs-spec/files/locations-geojson.tswithfilename: 'locations.geojson',format: 'geojson',presence,descriptionverbatim from reference,fields: undefinedsrc/gtfs-spec/index.tsOpen questions / notes for future phases
continuous_pickup/continuous_drop_off"or empty" enum values appear in bothroutes.txt(Phase 4) andstop_times.txt(Phase 5). The value1means "No continuous stopping pickup" and an empty string is equivalent — this is a GTFS spec quirk. The enum value should storevalue: 1with the "or empty" equivalence noted in the description verbatim. If the UI ever needs to handle this distinction (e.g. normalizing empty → 1 on import), that's an app-layer concern, not spec-layer.translations.txt'stable_nameenum values are GTFS filenames themselves — the spec lists exactly which tables are translatable. Copy that list verbatim.Original Issue
We should keep track of which parts of the spec we are supporting and which ones still need support. This can be done in an issue or in the repo or in the landing-zone (gtfs.zone landing page).
I think we should move from auto generated spec to an ai generated spec that is carefully reviewed (manually + ai) and then just subscribe to the GTFS RSS feed for updates. It doesn't change that often and they have a changelog, and users of this will not be on the cutting edge of GTFS
Add Compatibility Progress/Checklist/Status for GTFS Specto Replace GTFS spec scraper with hand-authored TypeScript library (+ implementation status doc)maxtkc referenced this issue2026-04-05 16:24:17 +00:00