# Plan: `skraak import segments` Command
## Overview
Import segments from AviaNZ `.data` files into the skraak database, applying a species/calltype mapping file.
## CLI Usage
```bash
skraak import segments \
--db ./db/skraak.duckdb \
--folder /path/to/folder \
--mapping /path/to/mapping_2026-03-14.json \
--dataset abc123 \
--location loc456 \
--cluster clust789
```
## Database Relationships (Critical for Insert Order)
```
dataset
└── location (FK: dataset_id)
└── cluster (FK: location_id, dataset_id)
└── file (FK: cluster_id, location_id)
└── file_dataset (FK: file_id, dataset_id)
└── file_metadata (FK: file_id)
└── segment (FK: file_id, dataset_id, file_dataset)
└── label (FK: segment_id, species_id, filter_id)
└── label_metadata (FK: label_id)
└── label_subtype (FK: label_id, calltype_id, filter_id)
species
└── call_type (FK: species_id)
```
## Insert Order (Per Segment)
1. **segment** → needs file_id (from DB by hash), dataset_id
2. **label** → needs segment_id (just inserted), species_id (from mapping), filter_id (from DB), certainty
3. **label_metadata** → needs label_id, stores `{"skraak_label_id": "...", "comment": "..."}`
4. **label_subtype** → needs label_id, calltype_id (from mapping), filter_id, certainty
## Files to Create/Modify
### 1. NEW: `utils/mapping.go`
Mapping file utilities:
- `SpeciesMapping` struct
- `LoadMappingFile(path string) (map[string]SpeciesMapping, error)`
- `ValidateMappingAgainstDB(db *sql.DB, mapping map[string]SpeciesMapping) error`
### 2. NEW: `tools/import_segments.go`
Core tool logic with types:
- `ImportSegmentsInput`
- `ImportSegmentsOutput`
- `ImportSummary`
- `SegmentImport`
- `LabelImport`
- `ImportError`
- `ImportSegments()` function
### 3. MODIFY: `cmd/import.go`
Add `segments` subcommand to switch statement and `runImportSegments()` handler.
## Detailed Algorithm
### Phase A: Input Validation
1. Validate folder exists and contains .data files
2. Validate dataset_id, location_id, cluster_id format (12-char nanoid)
3. Validate cluster belongs to location, location belongs to dataset
4. Validate dataset type is 'structured'
5. Load mapping file and validate JSON schema
### Phase B: Pre-Import Validation (Fail-Fast)
1. Parse all .data files, collect:
- Unique filter names
- Unique species names (from labels)
- Unique calltype names per species
2. Validate all filters exist in DB:
```sql
SELECT id FROM filter WHERE name IN (?) AND active = true
```
3. Validate mapping covers all species found in .data files:
- For each species in .data: must exist as key in mapping
- mapping[species].species must exist in species table
4. Validate mapping covers all calltypes:
- For each (species, calltype) in .data:
- If mapping[species].calltypes exists: calltype must map to existing call_type.label
- Else: calltype must exist as-is in call_type table for that species
5. For each .data file:
- Find corresponding WAV file (same folder, strip .data extension)
- Hash the WAV file
- Verify hash exists in file table for this cluster
- Verify no existing labels for this file (fresh imports only)
6. Validate segment bounds:
- start_time < end_time
- end_time <= file.duration
### Phase C: Transactional Import
```sql
BEGIN TRANSACTION;
-- Pre-load all ID lookups into memory
-- filterIDMap: filter_name -> filter_id
-- speciesIDMap: species_label -> species_id
-- calltypeIDMap: (species_label, calltype_label) -> calltype_id
-- fileIDMap: xxh64_hash -> file_id
FOR each .data file:
wavPath = dataPath without .data extension
hash = ComputeXXH64(wavPath)
fileID = fileIDMap[hash]
-- Update file_metadata with skraak_hash
INSERT OR REPLACE INTO file_metadata (file_id, json, active)
VALUES (fileID, JSON_SET(COALESCE(json, '{}'), '$.skraak_hash', ?), true)
FOR each segment in .data:
IF segment has any label with bookmark=true: CONTINUE
segmentID = GenerateLongID()
-- Clamp freq_low/freq_high (default 0, sample_rate/2)
INSERT INTO segment (id, file_id, dataset_id, start_time, end_time,
freq_low, freq_high, created_at, last_modified, active)
VALUES (segmentID, fileID, datasetID, start, end, freqLow, freqHigh, now(), now(), true)
FOR each label in segment:
labelID = GenerateLongID()
dbSpecies = mapping[label.Species].species
speciesID = speciesIDMap[dbSpecies]
filterID = filterIDMap[label.Filter]
certainty = label.Certainty
INSERT INTO label (id, segment_id, species_id, filter_id, certainty,
created_at, last_modified, active)
VALUES (labelID, segmentID, speciesID, filterID, certainty, now(), now(), true)
-- Build label_metadata JSON
metadata := {"skraak_label_id": labelID}
IF label.Comment != "": metadata["comment"] = label.Comment
INSERT INTO label_metadata (label_id, json, created_at, last_modified, active)
VALUES (labelID, metadata, now(), now(), true)
IF label.CallType != "":
subtypeID = GenerateLongID()
dbCalltype = mapping[label.Species].calltypes[label.CallType]
OR label.CallType if no mapping entry
calltypeID = calltypeIDMap[(dbSpecies, dbCalltype)]
INSERT INTO label_subtype (id, label_id, calltype_id, filter_id, certainty,
created_at, last_modified, active)
VALUES (subtypeID, labelID, calltypeID, filterID, certainty, now(), now(), true)
COMMIT;
```
## Mapping File Format
```json
{
"Don't Know": {
"species": "Don't Know"
},
"GSK": {
"species": "Roroa",
"calltypes": {
"Male": "Male - Solo",
"Female": "Female - Solo"
}
}
}
```
## Output JSON
```json
{
"summary": {
"data_files_found": 42,
"data_files_processed": 42,
"total_segments": 342,
"imported_segments": 342,
"imported_labels": 356,
"imported_subtypes": 280,
"processing_time_ms": 1234
},
"segments": [
{
"segment_id": "abc123...",
"file_name": "recording.wav",
"start_time": 165.0,
"end_time": 185.0,
"labels": [
{
"label_id": "def456...",
"species": "Roroa",
"calltype": "Male - Solo",
"filter": "opensoundscape-kiwi-1.2",
"certainty": 70
}
]
}
],
"errors": []
}
```
## Invariants
| Invariant | How Enforced |
|-----------|--------------|
| Operates on 1 folder | CLI flag validation |
| All certainty scores transferred | Direct copy from .data to DB |
| All filters exist in DB | Phase B validation |
| Mapping is complete | Phase B validation |
| File hashes exist in DB | Phase B - lookup by xxh64_hash |
| No existing labels | Phase B - query before import |
| Species/calltypes exist | Phase B validation |
| Fail-fast on error | All validation before transaction |
| Single transaction | `db.BeginLoggedTx()` with rollback |
## Key Code Patterns to Reuse
- `utils.ParseDataFile()` - Parse .data files
- `utils.FindDataFiles()` - Find .data files in folder
- `utils.ComputeXXH64()` - Hash WAV files
- `utils.GenerateLongID()` - 21-char nanoid for segment/label IDs
- `utils.ValidateShortID()` - Validate 12-char IDs
- `db.BeginLoggedTx()` - Transaction with mutation logging
- `utils.CheckDuplicateHash()` - Check if hash exists (adapt for this use)
## Decisions Made
1. ✅ WAV files same folder as .data files
2. ✅ freq_low/freq_high stored in segment table (clamp to sample_rate if needed)
3. ✅ Multiple labels per segment → multiple label records
4. ✅ ProgressHandler pattern for progress reporting
5. ✅ Fail-fast + single transaction