3ZSBMHZBDIR3SABGTLQVBGMC2YQCRM4XNVDYGSASBF6FXYJ4J74QC CUKIKQUX7Z75WYBQDHOCOOC3K532HSPW3FNKGCONVFFIP2YRUQHQC KAUXMZ65YCDW3SB2SMOQA5LQG234J64MPPXRKJG37ASDO3URYOSAC IWA42FYL3SDWJKVE4YOFDOERH5YZ2BHNK3JRQCQMKCZ2R2O7X56QC 7CC2YVZXAIUNWXNNVIO5KOZZFDQQLESFO72SGEDP2C4OZXAWO4KQC YKVJ72H7SQOWSBJ45XRJLGX5PRAVYEBTI4KVNSXLMDEBSTO6PKKAC 4LPMJNSUONNO2HBGNSKFTGU3RYJ2P7D736XJPXT235NKLBGFJDIQC L7ON3P7TC6NDF7HTGGWKTCM5E3E54W5S5ZED5NCDJJV7HC62TODAC 5IIWZLQR5ZKOW4OUJCQMJDPWXZKDLGRTK4DQIELXHX7VUBTXUJ4AC 5OCKB2DILMXJPPA7OOBNXL4N3CFXAF2L5NCKUH4ZSW6W6IRV4TIQC SQZWTGAV6XDIFU42C5KIBHF5GIGPTXB6552Q7FXQCGHYPGQLFB5QC EBCNGTNVY2YFFHKC4PHDEHNBOWJ4JDCEXTTJ3WKD5VWQZLLDZ65AC PNR54EKR7AG5KCHHFRY6RZJSUJXDZOT3ALIYUJ33ZBSYGYHR45CQC YO52AKKLNL6KEFZCL3NHNB4XN4S3LZ2ATMUTMES3TXWYEARUI7GQC M3RD4OYBFPVFM3LASIVNTENIVF2YPNQMVROMG7CJCS2O6SST5YNQC JQVXBZT5FKGPPIT5VJM3AWXZUJJTPFSCFSQOV4MS2WISPNAHH7JAC Saddleback_Sth_Is saddle3Silvereye silver3South Island Oystercatcher soioys1South Island Wren soiwre1Thrush_Song sonthr1Tomtit tomtit1Tui tui1Warbler_Grey gryger1Weka_spp weka1Yellowhammer yellow2
Saddleback_Sth_Is saddle3 xSilvereye silver3 xSouth Island Oystercatcher soioys1 xSouth Island Wren soiwre1 -Thrush_Song sonthr1 xTomtit tomtit1 xTui tui1 xWarbler_Grey gryger1 xWeka_spp weka1 xYellowhammer yellow2 x
endfunction flatten(signal, freq)if ndims(signal) > 1signal = vec(mean(signal, dims=2))end#signal = DSP.resample(signal, 16000.0f0 / freq)signal = vec(signal) # ensure Vector output#freq = 16000return signal, freqendfor item in aFLAC.load(FileIO.query(item)) |> x -> flatten(x[1], x[2]) |> x -> get_image_from_sample(x[1], x[2]) |> img -> save(replace(item, "flac"=>"png"), img)
---description: Auto-classify segments in one filter based on verified classifications from another filter. Use when one ML filter has been reviewed and you want to propagate those classifications to another filter on overlapping segments.---# Deduplicate Call SpecificationPropagate verified classifications from one filter to another for overlapping segments. This is useful when you have reviewed classifications from one ML model and want to apply those learnings to another model's segments.## When to Use- You have reviewed/classified segments for one filter (e.g., `opensoundscape-kiwi-1.2`)- You want to auto-classify overlapping segments in another filter (e.g., `opensoundscape-kiwi-1.5`)- Both filters have segments on the same audio files## Prerequisites- Folder with `.data` files containing segments from multiple filters- One filter already reviewed (certainty >= 80 for confirmed calls, certainty = 0 for confirmed noise)- `./skraak` binary built and available## Workflow### Step 1: Identify candidate segmentsRun summarise and use jq to find overlapping segments where:- Target filter has unverified classification (certainty = 70)- Segments overlap in time- Species/calltype differ between filters```bash./skraak calls summarise --folder "<folder_path>" > /tmp/summary.jsoncat /tmp/summary.json | jq '.segments |group_by(.file) |map({file: .[0].file,segments: [.[] | {start: .start_time, end: .end_time, labels: .labels}]}) |map({file: .file,auto_classify: [.segments[] as $s1 |.segments[] as $s2 |select(($s1.labels[0].filter == "<source_filter>") and($s2.labels[0].filter == "<target_filter>") and# Time overlap($s1.start | floor) <= ($s2.end | ceil) and($s2.start | floor) <= ($s1.end | ceil) and# Source is verified# Target is unverified($s2.labels[0].certainty == 70) and# They differ(($s2.labels[0].species != $s1.labels[0].species) or ($s2.labels[0].calltype != $s1.labels[0].calltype))) |{file: .file,source_start: $s1.start,source_end: $s1.end,source_species: $s1.labels[0].species,source_calltype: $s1.labels[0].calltype,source_certainty: $s1.labels[0].certainty,target_start: $s2.start,target_end: $s2.end,target_species: $s2.labels[0].species,target_calltype: $s2.labels[0].calltype,action: (if $s1.labels[0].certainty == 0 then "Noise" else $s1.labels[0].species + "+" + ($s1.labels[0].calltype // "N/A") end)}]}) |map(select(.auto_classify | length > 0)) |[.[] | .auto_classify[]]'```### Step 2: Present summary to userGroup results by action type:| Action | Count | Description ||--------|-------|-------------|| GSK+Male from "Don't Know" | N | Target had "Don't Know", source verified GSK Male || GSK+Female from "Don't Know" | N | Target had "Don't Know", source verified GSK Female || GSK+Duet from "Don't Know" | N | Target had "Don't Know", source verified GSK Duet || Calltype correction | N | Both had GSK but different calltypes || Noise from "Don't Know" | N | Source confirmed as noise (certainty=0) || Weka from "Don't Know" | N | Source identified as Weka || Other species | N | Other verified species |Ask user to confirm before applying changes.### Step 3: Apply approved changesUse `skraak calls modify` to update target filter segments:```bash# For species+calltype changes./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \--filter <target_filter> --segment <start>-<end> \# For noise./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \--filter <target_filter> --segment <start>-<end> \# For species without calltype (Weka, etc.)./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \--filter <target_filter> --segment <start>-<end> \```### Step 4: Verify results```bash./skraak calls summarise --folder "<folder_path>" --brief```Check that both filters now have similar species counts and the "Don't Know" count decreased in the target filter.## Classification Rules- **Reviewer**: always "Claude" (indicates LLM-assisted classification)- **Only modify** segments where target has certainty=70 (unverified)- **Never modify** segments where target already has certainty >= 80- **Handle noise carefully**: source certainty=0 means confirmed noise, propagate as Noise species## Segment MatchingSegments overlap when:```floor(source.start) <= ceil(target.end) AND floor(target.start) <= ceil(source.end)```The `--segment` parameter uses integer seconds, so floor/ceil handles the matching:- Source: 362.5-390 matches target: 360-390 via `--segment 362-390`- Source: 297.5-322.5 matches target: 295-323 via `--segment 297-323`## Example Output```Auto-Classification Candidates for opensoundscape-kiwi-1.5Based on verified opensoundscape-kiwi-1.2 classifications:### GSK from "Don't Know" (14 segments)| File | Segment | Action ||------|---------|--------|| ridgeend_20260303_204507 | 297-323 | → GSK+Female || ridgeend_20260303_211507 | 292-320 | → GSK+Female |...### Calltype Corrections (4 segments)| File | Segment | Current | Correct ||------|---------|---------|---------|| ridgeend_20260304_043007 | 360-388 | GSK+Duet | → GSK+Male |...Summary: 27 segments ready for auto-classificationApply these changes? (y/n)```## Notes- This is a one-way propagation: source filter → target filter- Run `skraak-check-call-classification` skill first to verify the source filter- The target filter may have unique segments not in source - these remain unverified- Species that sound similar (e.g., GSK and LSK) may need human review - flag these for user attention- **Always use certainty=90** for auto-classified segments (distinguishes from ML=70, human=100)--species <Species> --certainty 90--species Noise --certainty 90--species <Species>+<Calltype> --certainty 90(($s1.labels[0].certainty >= 90) or ($s1.labels[0].certainty == 0)) and- Source filter has verified classification (certainty >= 90 or = 0)name: propagate-call-classification
---name: check-call-classificationdescription: Check ML-classified bird call spectrograms by visual inspection, confirming or correcting calltype labels in .data files---# Check Call Classification## When to UseWhen the user wants to review/check ML calltype classifications on bird call segments. Typical use cases:- Check all GSK (Great Spotted Kiwi / Roroa) calltypes in a folder- Review a specific calltype (e.g. just GSK+Male)- Check "Don't Know" segments for any missed kiwi calls- Review segments from a specific ML filter## Prerequisites- Folder with `.data` files- Corresponding `.wav` files present (needed for clip generation)- `./skraak` binary built and available## Workflow### Step 1: Scope the reviewAsk the user for:- **Folder path** containing `.data` files- **Which filter** to check (e.g. `opensoundscape-kiwi-1.2`)- **Which species+calltypes** to review (e.g. `GSK`, `GSK+Male`, `Don't Know`)Run summarise to show what's available:```bash./skraak calls summarise --folder "<folder_path>" --brief > /tmp/summarise.json 2>&1```Show the user the filters, species, calltypes, and review status. Confirm:- The exact species labels to use for modifications (e.g. `GSK+Male`, `GSK+Female`, `GSK+Duet`, `Noise`, `Don't Know`)- Whether to skip certainty 100 segments (default: yes, these are human-reviewed)### Step 2: Load reference spectrogramsLook for reference examples in this skill's folder. Map species to reference folder:- GSK / Great Spotted Kiwi -> `Roroa/`- (Future species get their own folders)The folder structure is `<common_name>/<calltype>/` containing paired `.png` and `.wav` example files.Read 3 representative PNG images from each calltype subfolder (pick files evenly spaced through the sorted file list). Study the visual patterns:**GSK/Roroa calltypes:**- **Male Solo**: bold, regularly spaced vertical bars at lower frequency, strong signal- **Female Solo**: thinner, higher-frequency bars, often fainter/more delicate- **Duet**: overlapping male + female patterns visible simultaneously, two distinct frequency bands**Noise/False positive**: no clear repeated bar pattern, random speckle or broadband noise**Note on Frequency Scaling:** Spectrogram frequency ranges may vary based on recordingequipment (8kHz/16kHz/32kHz). Focus on the **relative pattern** (spacing, thickness,regularity of vertical bars) rather than absolute frequency position. Male calls showbold, regularly-spaced bars at lower frequencies relative to the display; female callsare thinner and higher relatively.If the samaple rate is above 16kHz, audio is downsampled to 16kHz, if sample rate is below 16kHz, it is left as is fro the spectrogram. In practice images will be 8kHz (most common) or 16kHz.### Step 3: Generate clips for the folderGenerate all clips at once using the folder mode:```bash./skraak calls clip --folder "<folder_path>" --prefix <prefix> --output /tmp/<prefix>/ \--filter <filter> --species <species> --size 224 --color```Use `--species GSK` (no calltype) to get all calltypes, or `--species GSK+Male` for a specific calltype.**Before reading any images**, report the number of spectrograms generated and ask the user to confirm they are OK with the token cost. Each 224px PNG is one image token. Give the user the option to proceed, reduce scope (e.g. filter to a specific calltype), or abort.### Token Efficiency Guidelines- **<100 segments**: Review all at once- **100-300 segments**: Review in one batch (acceptable cost)- **>300 segments**: Consider filtering by calltype (e.g., review Duets/Females first,then Males) or by date range to manage token costs### Step 4: Examine spectrogramsRead each generated `.png` file using the Read tool. Compare against reference images from Step 2.For each segment, determine:- **Skip**: too faint or uncertain -> leave as-is at current certainty (never downgrade)### When to Skip (Leave at certainty 70)- **Duration <15 samples** (very short segments)- **Signal-to-noise ratio too low** (fainter than reference examples)- **Fragmented/incomplete patterns** (file boundary artifacts)- **Uncertain classification** (pattern doesn't clearly match reference)- **Never guess** - skip rather than misclassifyParse segment start-end times from the clip filename: `<prefix>_<basename>_<START>_<END>.png`Map back to the source .data file using the basename (e.g. `check_tx45_20260304_203005_310_345.png` -> `tx45_20260304_203005.wav.data`, segment `310-345`).### Step 5: Present batch summaryPresent results in a table:```Folder: tx45/202603 march (filter: opensoundscape-kiwi-1.2, species: GSK)# | File | Segment | Current Label | Action | Notes---|----------------------|---------|----------------|----------------|------3 | tx45_20260304_211504 | 442-463 | GSK+Male (70) | Skip | Too faint to confirm```**Ask user to approve** before executing any modifications. User can override individual decisions.### Step 6: Execute approved changesConfirm correct classification (bump certainty only):```bash./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \```Reclassify to different calltype:```bash./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \```Mark as noise (only if clearly not a bird call):```bash./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \```Bookmark an outstanding example:```bash./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \```# View a specific modified filecat <folder>/<filename>.wav.data | grep -A5 '"segment_start":<start>'```### Step 8: Final summary```Review CompleteFolder: <path>Filter: <filter>, Species: <species>Segments reviewed: NReclassified: B (%)Skipped (left as-is): C (%)```## Classification Rules- **Reviewer**: always the actual model being used, eg haiku-4.5, kimi-k2.5, glm-5, minimax-m2.7,- **Always ask** the user for labels if you see something identifiable that isn't in the expected species list- Use `calls summarise --brief` to check existing labels in the folder## Notes- Image size defaults to 224px with `--color`. Use `--size 448` if more detail needed (higher token cost)- Always report clip count and ask user before reading images (token cost awareness)- Distant/faint calls are common - skip rather than guess- The ML models can be ~97% accurate, so most classifications will often be correct. The goal is to increase certainty on correct ones, not to second-guess the model.## Calltype-Specific Tips### GSK/Roroa Specific Guidance- **Male**: Most common, look for bold vertical stripes at ~1/3 to 1/2 of spectrogram height- **Female**: Higher frequency, thinner bars, often fainter - don't mistake faintness for absence- **Duet**: Both patterns present simultaneously - if you see ANY female component with male, it's a Duet- **Distant calls**: Fainter versions of same patterns are still valid - only skip if pattern is unclear### Reference Image Strategy1. Read 2-3 reference images from EACH calltype folder2. Mentally note the **key distinguishing features**:- Male: Bar thickness, spacing regularity, relative frequency position- Female: Thin bars, higher position, delicate appearance- Duet: Two distinct frequency bands with overlapping patterns- The `.data.bak` files serve as backups never modify them- **Never modify** segments with certainty >= 90 (already reviewed by LLM or human)- **Only upgrade certainty**: confirm (70 -> 90) or reclassify (-> 90). Never downgrade. If uncertain, skip the segment and leave it as-is.- **Certainty 90**: use for all LLM classifications (distinguishes from ML=70, human=100)Confirmed (-> 90): A (%)Add a comment to preserve information:```bash./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \```**Bookmark**: Use `--bookmark` to mark outstanding, clear, or beautiful examples of a call. Bookmarked segments can be easily found later for training reference or demonstration purposes.**Comment**: Use `--comment "text"` to preserve useful information about a segment. Comments are stored in the .data file and visible when classifying calls manually. Max 140 characters, ASCII only. Examples:- "Clear duet with visible female component"- "Unusual timing pattern, verify with audio"- "Good training example for male vs female distinction"### Step 7: Verify Changes (Optional)Check that modifications were applied correctly:```bash# Count segments at certainty 90grep -r '"certainty": 90' <folder>/*.data | wc -l--filter <filter> --segment <start>-<end> --certainty 90 --comment "Clear duet with visible female component"--filter <filter> --segment <start>-<end> --certainty 90 --bookmark--filter <filter> --segment <start>-<end> --species Noise --certainty 90--filter <filter> --segment <start>-<end> --species GSK+Duet --certainty 90--filter <filter> --segment <start>-<end> --certainty 901 | tx45_20260304_204504 | 132-163 | GSK+Male (70) | Confirm (90) | Clear male pattern2 | tx45_20260305_040003 | 352-395 | GSK+Male (70) | -> GSK+Duet(90)| Female visible too- **Confirm**: classification is correct -> upgrade certainty to 90- **Reclassify**: wrong calltype -> modify to correct species+calltype at certainty 90Review ML-classified bird call segments by visually inspecting spectrogram images. Confirm correct classifications or reclassify incorrect ones. This skill can only **upgrade** certainty (e.g. 70 -> 90), never downgrade it.
---name: import-segmentsdescription: Import classified segments from a single folder of .data files into the database - looks up IDs, creates mapping, runs import---# Import Segments from a Single FolderImport manually classified AviaNZ `.data` file segments into the database for one folder (one cluster).## When to UseWhen the user wants to import classified `.data` files from a folder. The folder must already have `.data` files (from `calls classify` or `calls from-preds`) and the WAV files must already be imported into the database.## PrerequisitesBefore this skill runs, the user should have already:1. Generated predictions and created `.data` files (`calls from-preds` or similar)2. Classified calls (`calls classify`)3. Imported the WAV files into the database (`import folder` or `import bulk`)## Workflow### Step 1: Identify the folderGet the folder path from the user. Confirm `.data` files exist:```bashls <folder>/*.data | head -5```### Step 2: Determine dataset, location, and cluster IDsExtract the cluster/location name from the folder path (e.g., `tx51` from `.../tx51/`). Query the database to find matching IDs:```bash./skraak sql --db ./db/skraak.duckdb "SELECT c.id AS cluster_id, c.name AS cluster_name, l.id AS location_id, l.name AS location_name, d.id AS dataset_id, d.name AS dataset_name FROM cluster c JOIN location l ON c.location_id = l.id JOIN dataset d ON l.dataset_id = d.id WHERE c.active = true AND c.name ILIKE '%tx51%'"```Adjust the `ILIKE` pattern based on the folder name. If multiple results, ask the user to pick. Show the user the matched dataset/location/cluster and confirm before proceeding.### Step 3: Check mapping fileLook for an existing `mapping_*.json` in the folder:```bashls <folder>/mapping_*.json```If no mapping exists, tell the user to run the `skraak-data-mapping` skill first:```No mapping file found. Please run: /skraak-data-mapping```Then stop and wait for the user.### Step 4: Run the import```bash./skraak import segments \--db ./db/skraak.duckdb \--dataset <dataset_id> \--location <location_id> \--cluster <cluster_id> \--folder "<folder_path>" \--mapping "<folder_path>/mapping_YYYY-MM-DD.json"```Use the production database (`./db/skraak.duckdb`) unless the user specifies otherwise.### Step 5: Report resultsShow the user the import summary from the JSON output:- Data files processed- Segments imported- Labels imported- Any errors## Error Handling- If no `.data` files found: abort with clear message- If cluster/location not found in DB: ask user for the correct name or IDs- If files not yet imported (hash mismatch): tell user to run `import folder` first- If segments already have labels: warn user - `import segments` only works on files with no existing labels
---name: dataset-reportdescription: Generate a report showing dataset breakdown - locations, file counts per location and per dataset - handling both structured and unstructured dataset types---# Dataset ReportGenerate a summary report of all active datasets showing location counts, file totals, and files per location.## When to UseWhen the user asks for a dataset overview, summary, breakdown, or report.## QueryStructured datasets link files through `location > cluster > file`. Unstructured datasets link files through the `file_dataset` junction table with no location hierarchy.```sqlWITH structured AS (SELECTd.name AS dataset,d.type,COUNT(DISTINCT l.id) AS locations,COUNT(f.id) AS total_filesFROM dataset dLEFT JOIN location l ON d.id = l.dataset_id AND l.active = trueLEFT JOIN cluster c ON l.id = c.location_id AND c.active = trueLEFT JOIN file f ON c.id = f.cluster_id AND f.active = trueWHERE d.active = true AND d.type != 'unstructured'GROUP BY d.name, d.type),unstructured AS (SELECTd.name AS dataset,d.type,0 AS locations,COUNT(fd.file_id) AS total_filesFROM dataset dLEFT JOIN file_dataset fd ON d.id = fd.dataset_idLEFT JOIN file f ON fd.file_id = f.id AND f.active = trueWHERE d.active = true AND d.type = 'unstructured'GROUP BY d.name, d.type),combined AS (SELECT * FROM structuredUNION ALLSELECT * FROM unstructured)SELECTdataset,type,locations,total_files,CASE WHEN locations > 0THEN ROUND(total_files::DECIMAL / locations, 0)ELSE NULL END AS files_per_locationFROM combinedORDER BY total_files DESC```## OutputPresent results as a markdown table with a summary line showing totals.- Use `-` for files_per_location on unstructured datasets (no location hierarchy)- Format large numbers with commas for readability- Include dataset type column
---name: data-mappingdescription: Build a mapping.json file that translates species and calltype names from .data files to DB labels, using interactive prompts---# Data File Species/Calltype MappingBuild a `mapping.json` that translates `.data` file species and calltype names to database `species.label` and `call_type.label` values. Map-only: never create new species or calltypes in the DB.## When to UseWhen the user needs to map species/calltypes from `.data` files before importing selections. Typically run before `skraak import selections`.## Workflow### Step 1: Summarise .data files```bash./skraak calls summarise --folder <folder> --brief```Parse the JSON output. Extract unique species and per-species calltypes from the `filters` object. Each filter has:- `species`: map of species name to count- `calltypes`: map of species name to (calltype name to count)Collect all unique species across all filters, and for each species collect all unique calltypes.### Step 2: Query DB for available species and calltypes```bash./skraak sql --db ./db/skraak.duckdb "SELECT id, label FROM species WHERE active = true"./skraak sql --db ./db/skraak.duckdb "SELECT ct.label, s.label as species FROM call_type ct JOIN species s ON ct.species_id = s.id WHERE ct.active = true ORDER BY s.label, ct.label"```Parse the JSON results. Build a lookup: DB species labels, and per-species DB calltype labels.### Step 3: Interactive mappingFor each unique species found in `.data` files:1. Show the user the .data species name and its count2. Use `AskUserQuestion` with DB species labels as options (pick the most likely matches, up to 4 options + "Skip - no match")3. If the user picks a DB species and that species has calltypes in the .data files:- For each .data calltype, use `AskUserQuestion` with that DB species' calltypes as options (up to 4 + "Keep as-is")- "Keep as-is" means the .data calltype name equals the DB calltype name (no mapping needed, omit from calltypes map)### Step 4: Write mapping.jsonSave to `<folder>/mapping_YYYY-MM-DD.json` (using the current date, e.g. `mapping_2026-03-14.json`):```json{"Don't Know": {"species": "Don't Know"},"GSK": {"species": "Haast Tokoeka","calltypes": {"Male": "Male - Solo","Female": "Female - Solo"}},"Morepork": {"species": "Morepork"}}```Structure rules:- Top-level keys = .data file species names- `species` value = DB species label- `calltypes` map = only present if calltypes exist AND at least one needs remapping- Each calltype entry: .data calltype name -> DB calltype label- Omit calltypes that map to themselves (user chose "Keep as-is")- If a species maps to itself with no calltype remapping, still include it with just `{"species": "SpeciesName"}`- If user chose "Skip", warn and omit that species entirely### Step 5: Display summaryPrint the final `mapping.json` content to the user as formatted JSON.## Error Handling- If a .data species has no reasonable DB match, warn and skip (don't write to mapping)- If a DB species has no calltypes defined but .data files have calltypes for it, warn the user (the calltypes will be ignored on import)- If `calls summarise` finds no .data files, abort with clear message- Never create new species or calltypes in the DB## Example Session```Found 4 species in .data files: GSK (342), Don't Know (28), Morepork (15), Weka (3)Mapping species: "GSK" (342 segments)> Which DB species? [Haast Tokoeka | Great Spotted Kiwi | Stewart Island Tokoeka | Other | Skip]User picks: Haast TokoekaGSK has calltypes: Male (200), Female (120), Duet (22)> Map calltype "Male"? [Male - Solo | Male - Duet | Keep as-is | Other]User picks: Male - Solo...Mapping species: "Don't Know" (28 segments)> Which DB species? [Don't Know | Skip]User picks: Don't KnowWriting mapping.json...```
---name: bulk-import-setupdescription: Set up bulk import for acoustic monitoring datasets - fuzzy match existing location NAMES, create new ones with GPS parsing if log.txt present in folder, and generate import CSV with comprehensive logging---# Bulk Import Setup for Acoustic Monitoring DatasetsThis skill helps you prepare acoustic monitoring datasets for bulk import by analyzing locations, matching existing records, creating new locations with optional GPS parsing, and generating import CSV files.## When to Use This SkillUse this skill when the user needs to:- Import a batch of acoustic recordings organized in location folders- Create multiple location records for a dataset- Match folder names to existing database locations (fuzzy matching)- Parse GPS coordinates from log files- Generate CSV files for bulk file import## Workflow Overview1. **Analyze** directory structure and existing locations2. **Fuzzy match** folder names to database locations (NAME ONLY - never GPS!)3. **Parse GPS log.txt files** from folders4. **Prepare CSV** with date-range cluster names and file counts5. **Check with user** before creating any new locations6. **Create locations** (if confirmed) with GPS or fallback coordinates7. **Update existing locations** with GPS data (if they have assumed coords)8. **Update CSV** with final location IDs9. **Verify** and provide bulk import command**NEVER match on GPS coordinates!** Recorders can be physically close but represent different locations.```python# ❌ WRONG - Don't compare GPS distancesif abs(loc1_lat - loc2_lat) < 0.001: # NO!return "match"# ✅ CORRECT - Only compare normalized namesnormalized_name1 = normalize_location_name(folder_name)normalized_name2 = normalize_location_name(existing_name)score = SequenceMatcher(None, normalized_name1, normalized_name2).ratio()```### 2. GPS Log Files are INSIDE Folders**Parse log files by folder structure**, not by separate log file list.```python# ✅ CORRECT - Log file is inside the folderfolder_path = "/path/to/mok_bl76/"log_path = folder_path / "*log.txt" # Inside the folder!# Extract folder name from path for mappingfolder_name = Path(folder_path).name # "mok_bl76"gps_data[folder_name] = parse_gps_log(log_path)```### 3. Update Matched Locations with GPS DataIf fuzzy matching finds an existing location with "assumed" coordinates, **update it** with GPS data if GPS data is available.```pythonif matched_location:# Check if location has assumed Takaka coordinatesif matched_location['description'].contains("assumed") or \matched_location['description'].contains("Takaka") or \matched_location['description'].contains("pending GPS"):# We have GPS data - UPDATE the location!if gps_data:update_location(matched_location['id'], gps_data['lat'], gps_data['lon'])```### 4. Cluster Names = Date Ranges (NOT Location Names!)**Cluster names represent time periods**, not locations.```python# ❌ WRONG - Using location namecluster_name = "mok_bl76"# ✅ CORRECT - Using date range from filesfirst_file = "mok_bl76_20250422_173008.wav"last_file = "mok_bl76_20250503_064508.wav"cluster_name = "2025-04-22 to 2025-05-03"```### 5. CSV Must Have 6 Columns (Including file_count)The bulk import expects **exactly 6 columns**:```csvlocation_name,location_id,directory_path,date_range,sample_rate,file_countmok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68```**Missing file_count will cause**: `Error: CSV row 2 has insufficient columns (expected 6, got 5)`## Step 1: Analyze Directory StructureFirst, understand the organization:```python# Key questions to answer:- How many location folders?- Naming patterns (dash/underscore)?- GPS log files present?- File counts per location?- Date formats in filenames?```**Check for:**- Log files: `LOCATION-log.txt` or `log.txt`- GPS data format: `GPS (lat,long): -40.123456,172.123456`- Filename date patterns: `YYYYMMDD` or `DDMMYY` or `YYMMDD`## Step 2: Fuzzy Match Existing Locations**IMPORTANT**: Always query existing locations first to avoid duplicates!### Fuzzy Matching Rules**CRITICAL: Use strict threshold (0.90) to avoid false matches!**```pythonfrom difflib import SequenceMatcherdef normalize_location_name(name: str) -> str:"""Normalize location name for fuzzy matching."""# Remove common suffixes/prefixesname = name.lower()name = re.sub(r'[_\-\s]+', '', name) # Remove underscores, hyphens, spacesname = re.sub(r'(low|high)$', '', name) # Remove Low/High suffixname = re.sub(r'^mok', 'mok', name) # Normalize mok prefixreturn namedef fuzzy_match_score(s1: str, s2: str) -> float:"""Calculate fuzzy match score between two strings."""return SequenceMatcher(None, normalize_location_name(s1), normalize_location_name(s2)).ratio()def fuzzy_match_location(folder_name: str, existing_locations: List[Dict]) -> Optional[Dict]:"""Fuzzy match folder name to existing location BY NAME ONLY.IMPORTANT:- Returns match ONLY if similarity > 0.90- NEVER compares GPS coordinates (locations can be close but different!)This threshold prevents false matches like:- "mok_bl6P" matching to "mok_bl16" (score: 0.86) ✗ Different locations!- "mok_tuna51" matching to "mok_tuna31" (score: 0.89) ✗ Different locations!But allows:- "mok_bl6P_Low" matching to "mok_bl6P" (score: 1.00) ✓ Same location"""normalized_folder = normalize_location_name(folder_name)best_match = Nonebest_score = 0.0for loc in existing_locations:normalized_loc = normalize_location_name(loc['name'])score = SequenceMatcher(None, normalized_folder, normalized_loc).ratio()if score > best_score:best_score = scorebest_match = loc# Strict threshold - only match if very similarif best_score > 0.90:print(f" Matched '{folder_name}' -> '{best_match['name']}' (score: {best_score:.3f})")# Check if matched location needs GPS updateif 'description' in best_match:desc = best_match['description'].lower()if 'assumed' in desc or 'takaka' in desc or 'pending gps' in desc:print(f" ⚠️ Location has assumed coords - check for GPS data to update")return best_matchreturn None```**Why 0.90 threshold?**- Too low (0.80-0.85): False matches → wrong location assignments- 0.90: Only matches very similar names (case/punctuation differences)- Example: `mok_bl6P_Low` (score: 1.00) ✓ vs `mok_bl16` (score: 0.86) ✗### Query Existing Locations```pythoncmd = [SKRAAK_BIN, "sql","--db", db_path,f"SELECT name, id FROM location WHERE dataset_id = '{dataset_id}' AND active = true","--limit", "500"]```**Note**: If production DB is locked, ask to use use backup database for querying, but note that locations will be created in production later.**Note**: Don't forget to check some file sample rates using skraak metadata command, or sox.## Step 3: Prepare CSV with Analysis**Create preparation script** that:1. Matches folders to existing locations2. Identifies which locations need to be created3. Analyzes file metadata (count, sample rate, date range)4. Generates CSV with placeholder IDs for new locations5. Writes comprehensive log file### CSV Format**CRITICAL: Must have exactly 6 columns with date-range cluster names!**```csvlocation_name,location_id,directory_path,date_range,sample_rate,file_countmok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68```**Key points:**- `dataset_id`: 12-character nanoid- `location_id`: Actual ID or `<CREATE:name>` placeholder- `cluster_name`: **Date range** (e.g., "2025-04-22 to 2025-05-03") - NEVER location name!- `folder_path`: Full absolute path- `sample_rate`: Integer (e.g., 8000, 32000)- `file_count`: Integer count of WAV files in folder### Log File FormatCreate detailed log showing:- Each location processed- Match status (existing or new)- File counts and metadata- Summary statistics- Next steps```pythonLOG_FILE = BASE_DIR / "import_log.txt"with open(LOG_FILE, 'w') as log_file:log(f"Processing: {folder_name}", log_file)if matched:log(f" ✓ MATCH: {folder} → {existing} (ID: {id})", log_file)else:log(f" + NEW: {folder} (will create)", log_file)log(f" Files: {count}, Sample Rate: {rate} Hz, Date: {date}", log_file)```## Step 4: Check with User Before Creating Locations**CRITICAL**: Never create locations without user confirmation!Present summary and ask:```PLAN READY - AWAITING CONFIRMATION===================================Matched existing: 17Need to create: 5Total files: 8,003Locations to be created:- new-location-1 (420 files)- new-location-2 (350 files)...Shall I proceed to create these 5 new locations?```**Wait for user approval** before executing any location creation!## Step 5: Create Locations (If Confirmed)### GPS Parsing (If Available)```pythondef parse_gps_from_log(log_file: Path) -> Tuple[float, float, int]:"""Parse and average GPS coordinates from log file.IMPORTANT: Use errors='ignore' to handle encoding issues!Some log files contain non-UTF-8 bytes that will crash without this."""if not log_file.exists():return None, None, 0gps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')latitudes = []longitudes = []# CRITICAL: errors='ignore' prevents UnicodeDecodeErrorwith open(log_file, 'r', encoding='utf-8', errors='ignore') as f:for line in f:match = gps_pattern.search(line)if match:latitudes.append(float(match.group(1)))longitudes.append(float(match.group(2)))if latitudes:avg_lat = sum(latitudes) / len(latitudes)avg_lon = sum(longitudes) / len(longitudes)return (avg_lat, avg_lon, len(latitudes))return None, None, 0```### Location Creation```python# Use CLI tool (not MCP) for scriptingcmd = ["./skraak", "create", "location","--db", db_path,"--dataset", dataset_id,"--name", location_name,"--lat", str(latitude),"--lon", str(longitude),"--timezone", "Pacific/Auckland", # NZ default"--description", description]result = subprocess.run(cmd, capture_output=True, text=True, check=True)response = json.loads(result.stdout)location_id = response["location"]["id"]```### Description Format**With GPS:**```pythondescription = f"Averaged from {fix_count} GPS readings"# Example: "Averaged from 14 GPS readings"```**Without GPS (fallback to Takaka if Kahurangi data, Te Anau if Fiordland data):**```pythondescription = "Location assumed to be Takaka pending GPS data"# Note: Takaka coords are -40.85085, 172.80703```**Generic (user preference - if provided):**```pythondescription = "Friends of Cobb - Assumed location (Takaka) - no GPS data available"```## Step 6: File Metadata Analysis### Count WAV Files (RECURSIVE!)**CRITICAL: Always count recursively!** Files may be organized in daily subfolders.```pythondef count_wav_files(folder_path: str) -> int:"""Count WAV files recursively (searches all subdirectories).IMPORTANT: Don't use glob("*.wav") - it only searches top level!Files are often organized in daily subfolders:location/2025-03-08/file1.wavfile2.wav2025-03-09/file3.wav"""count = 0for root, dirs, files in os.walk(folder_path):for file in files:if file.lower().endswith('.wav'):count += 1return count```### Extract Sample Rate (RECURSIVE!)```pythondef get_sample_rate(folder_path: str) -> int:"""Get sample rate from first WAV file (searches recursively).IMPORTANT: Search recursively since files may be in subfolders!"""for root, dirs, files in os.walk(folder_path):for file in files:if file.lower().endswith('.wav'):file_path = os.path.join(root, file)# Read WAV header (sample rate at bytes 24-27)try:with open(file_path, 'rb') as f:f.seek(24)sample_rate = int.from_bytes(f.read(4), byteorder='little')return sample_rateexcept:passreturn 8000 # Default fallback```### Parse Date Range (RECURSIVE!)**Support both formats:**```pythondef extract_date_from_filename(filename: str) -> Optional[str]:"""Extract date from filename like mok_bl6P_20250308_200003.wav"""match = re.search(r'(\d{8})_', filename)if match:date_str = match.group(1)# Convert YYYYMMDD to YYYY-MMreturn f"{date_str[:4]}-{date_str[4:6]}"return Nonedef determine_date_range(folder_path: str) -> str:"""Determine date range from WAV filenames (searches recursively).IMPORTANT: Search recursively since files may be in daily subfolders!"""dates = set()for root, dirs, files in os.walk(folder_path):for file in files:if file.lower().endswith('.wav'):date = extract_date_from_filename(file)if date:dates.add(date)if dates:sorted_dates = sorted(dates)if len(sorted_dates) == 1:return sorted_dates[0]else:return f"{sorted_dates[0]} to {sorted_dates[-1]}"return "unknown"```## Step 7: Update CSV and VerifyAfter creating locations, update CSV with real IDs:```python# Update TO_BE_CREATED → real location IDs# Regenerate CSV with all real IDs```### Verification Queries```sql-- Check all locations createdSELECT id, name, latitude, longitude, SUBSTRING(description, 1, 50)FROM locationWHERE dataset_id = 'DATASET_ID'AND name LIKE 'PREFIX%'AND active = trueORDER BY name;-- Verify row count matches expected```### CSV Verification```bash# Should be N+1 rows (header + N locations)wc -l import.csv# Sum file countscat import.csv | tail -n +2 | cut -d',' -f6 | paste -sd+ | bc```## Database Safety### Use Backup for Queries if Production Locked```python# Query backup database for existing locationsQUERY_DB = Check, expect something like "/home/david/go/src/skraak/db/backup_YYYY-MM-DD_X.duckdb"# Create locations in production (when unlocked)PROD_DB = "/home/david/go/src/skraak/db/skraak.duckdb"```### Always Default to Test Database⚠️ **Unless user explicitly specifies production:**```pythonDB_PATH = "/home/david/go/src/skraak/db/test.duckdb" # Default```## Example Usage### Example 1: All Locations Already Exist**User Request:**```Setup bulk import for Friends of Cobb dataset:- Dataset ID: QZ0tlUrX4Nyi- 17 location folders- Use Takaka coordinates (-40.85085, 172.80703)- Generate CSV at: /path/to/foc_import.csv```**Your Response:**1. Query existing locations (17 found)2. Fuzzy match all 17 folders → all matched!3. Generate CSV with real location IDs4. Report: "All locations exist, CSV ready for import"5. Provide bulk import command### Example 2: Create New Locations (MOK Import)**User Request:**```Import Manu o Kahurangi kiwi survey:- 23 location folders- 8 have GPS log files- 15 need Takaka default coords- Generate CSV at: /path/to/mok_import.csv```**Your Workflow:**1. **Parse GPS logs** (8 locations):```pythonGPS_DATA = {"mok_bl6P_Low": {"lat": -40.823070, "lon": 172.582703, "fixes": 14},"mok_bl6T_Low": {"lat": -40.797182, "lon": 172.594582, "fixes": 14},...}```2. **Query existing locations** with strict fuzzy matching (threshold 0.95):- Result: No matches found (all new locations)3. **Count files recursively** (files in daily subfolders):- Total: 15,212 WAV files across 23 locations4. **Present plan to user**:```PLAN: Create 23 new locations- 8 with GPS coordinates (averaged from log files)- 15 with Takaka defaultsTotal files: 15,212Proceed? [yes/no]```5. **After approval, create all locations**:```pythonfor location in locations:if location in GPS_DATA:gps = GPS_DATA[location]description = f"Averaged from {gps['fixes']} GPS readings"else:gps = {'lat': DEFAULT_LAT, 'lon': DEFAULT_LON}description = "Location assumed to be Takaka pending GPS data"location_id = create_location(location, gps['lat'], gps['lon'], description)```6. **Generate CSV** with real location IDs7. **Verify**:- Query database to confirm all 23 locations created- Verify CSV has 23 rows + header- Provide bulk import command## Output Summary Template```======================================================================BULK IMPORT SETUP COMPLETE======================================================================Dataset: Dataset Name (ID: abc123)Database: /path/to/dbLocations processed: 17/17- Existing (matched): 17- New (created): 0- Failed: 0Total files: 8,003CSV Generated: /path/to/import.csvLog file: /path/to/import_log.txtReady for bulk import!Run:cd /home/david/go/src/skraak./skraak import bulk --db ./db/skraak.duckdb \--dataset DATASET_ID \--csv "/path/to/import.csv" \--log /path/to/bulk_import.log```## All-in-One Python Script TemplateFor complex imports with many locations, create a comprehensive Python script:```python#!/usr/bin/env python3"""Bulk import setup: Parse GPS, create locations, generate CSV"""import osimport reimport jsonimport subprocessfrom pathlib import Pathfrom difflib import SequenceMatcherfrom typing import Dict, List, Optional, Tuple# ConfigurationDATASET_ID = "abc123xyz789"DB_PATH = "./db/skraak.duckdb"TIMEZONE = "Pacific/Auckland"BASE_PATH = "/path/to/data"DEFAULT_LAT = -40.85085DEFAULT_LON = 172.80703# GPS log files (if available)GPS_LOGS = ["location1/log.txt", "location2/log.txt"]# Location foldersFOLDERS = ["location1", "location2", ...]def parse_gps_log(log_path: str) -> Optional[Tuple[float, float, int]]:"""Parse GPS log and return (lat, lon, fix_count)"""if not os.path.exists(log_path):return Nonegps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')coords = []with open(log_path, 'r', encoding='utf-8', errors='ignore') as f:for line in f:match = gps_pattern.search(line)if match:coords.append((float(match.group(1)), float(match.group(2))))if coords:avg_lat = sum(c[0] for c in coords) / len(coords)avg_lon = sum(c[1] for c in coords) / len(coords)return (avg_lat, avg_lon, len(coords))return Nonedef count_wav_files_recursive(folder_path: str) -> int:"""Count WAV files recursively"""count = 0for root, dirs, files in os.walk(folder_path):count += sum(1 for f in files if f.lower().endswith('.wav'))return countdef get_sample_rate_recursive(folder_path: str) -> int:"""Get sample rate from first WAV file found"""for root, dirs, files in os.walk(folder_path):for file in files:if file.lower().endswith('.wav'):try:with open(os.path.join(root, file), 'rb') as f:f.seek(24)return int.from_bytes(f.read(4), byteorder='little')except:passreturn 8000def create_location(name: str, lat: float, lon: float, desc: str) -> str:"""Create location via CLI and return ID"""cmd = ["./skraak", "create", "location","--db", DB_PATH,"--dataset", DATASET_ID,"--name", name,"--lat", str(lat),"--lon", str(lon),"--timezone", TIMEZONE,"--description", desc]result = subprocess.run(cmd, capture_output=True, text=True, check=True)return json.loads(result.stdout)['location']['id']def main():# Parse GPS logsgps_data = {}for log_file in GPS_LOGS:result = parse_gps_log(os.path.join(BASE_PATH, log_file))if result:folder = log_file.split('/')[0]gps_data[folder] = {'lat': result[0],'lon': result[1],'fixes': result[2]}# Process locationslocation_map = {}for folder in FOLDERS:folder_path = os.path.join(BASE_PATH, folder)# Determine coordinates and descriptionif folder in gps_data:lat = gps_data[folder]['lat']lon = gps_data[folder]['lon']desc = f"Averaged from {gps_data[folder]['fixes']} GPS readings"else:lat = DEFAULT_LATlon = DEFAULT_LONdesc = "Location assumed to be Takaka pending GPS data"# Create locationlocation_id = create_location(folder, lat, lon, desc)# Get metadatafile_count = count_wav_files_recursive(folder_path)sample_rate = get_sample_rate_recursive(folder_path)location_map[folder] = {'id': location_id,'files': file_count,'sample_rate': sample_rate}# Generate CSV (use manual writes, not csv module, to avoid \r\n line endings)csv_path = os.path.join(BASE_PATH, "import.csv")with open(csv_path, 'w') as f:f.write("location_name,location_id,directory_path,date_range,sample_rate,file_count\n")for folder, info in location_map.items():f.write(f"{folder},{info['id']},{BASE_PATH}/{folder},2025-02,{info['sample_rate']},{info['files']}\n")print(f"✓ Created {len(location_map)} locations")print(f"✓ Generated CSV: {csv_path}")if __name__ == '__main__':main()```## Common Patterns### Date Format Detection```python# YYYYMMDD: 20231115_050004# DDMMYY: 011123_050004# YYMMDD: 231101_050004# Heuristic: If first 4 chars > 2000, likely YYYYMMDD# Can use variance to disambiguate YYMMDD/DDMMYY (many days, fewer years)```### Multiple Sample Rates```python# Don't assume uniform - check each location# Common: 8000 Hz, 16000 Hz, 32000 Hz, 250000 Hz```### Trip References in Names```python# Location names may include trip references:# foc-thup-foc01 → Friends of Cobb, trip 1# foc-thup-fof11 → Friends of Flora, trip 11```## Troubleshooting### Production Database Locked**Solution**: Use backup database for queries, create locations later:```python# Step 1: Query backup for matches# Step 2: Prepare CSV with analysis# Step 3: Check with user# Step 4: Wait for prod unlock# Step 5: Check prod agrees with backup.# Step 5: Create locations in prod# Step 6: Update CSV with real IDs```### Ambiguous Fuzzy Matches**Problem**: One database location matches multiple folders**Solution**: Be conservative - create new locations for each folder```python# foc-thup-f could match:# - foc-thup-foc01# - foc-thup-fof11# create new locations for each folder```### Zero-Byte Files**Handling**: Count them but note in logs```pythonfile_count = len(list(directory.glob("*.wav")))zero_byte_files = [f for f in directory.glob("*.wav") if f.stat().st_size == 0]if zero_byte_files:log(f" ⚠ {len(zero_byte_files)} zero-byte files found", log_file)```## Key Lessons (From Production Imports)### 🔴 CRITICAL: Stricter Fuzzy Matching (0.9 threshold)**Problem:** Threshold of 0.80 caused false matches:- `mok_bl6P` matched to `mok_bl16` (score: 0.86) → WRONG! Different locations- `mok_tuna51` matched to `mok_tuna31` (score: 0.89) → WRONG! Different locations**Solution:** Use 0.9 threshold:```pythonif best_score > 0.9: # NOT 0.80!return best_match```### 🔴 CRITICAL: Always Search Recursively**Problem:** Using `glob("*.wav")` missed files in subdirectories.**Solution:** Use `os.walk()` everywhere:```pythonfor root, dirs, files in os.walk(folder_path): # NOT glob()!for file in files:if file.lower().endswith('.wav'):# process file```**Applies to:**- File counting- Sample rate detection- Date range parsing### 🔴 CRITICAL: Handle Encoding Errors**Problem:** Some GPS log files contain non-UTF-8 bytes → UnicodeDecodeError**Solution:** Always use `errors='ignore'`:```pythonwith open(log_file, 'r', encoding='utf-8', errors='ignore') as f: # errors='ignore'!```### 🟡 IMPORTANT: Remove Suffixes in NormalizationNormalize location names to handle suffix variations:```pythonname = re.sub(r'(low|high)$', '', name.lower()) # Remove Low/High# mok_bl6P_Low → mok_bl6p (for matching)```### 🟡 IMPORTANT: All-in-One Python ScriptFor complex imports (10+ locations), create a single Python script that:1. Parses GPS logs2. Counts files recursively3. Gets metadata4. Creates locations via CLI5. Generates CSV**Benefits:**- Single point of failure- Easy to debug- Reusable for similar imports- Complete audit trail## Guidelines- Always check existing locations before creating new ones (avoid duplicates)- Use **strict** fuzzy matching (0.9 threshold) to avoid false matches- **Always search recursively** with `os.walk()` (files in subfolders!)- Prepare CSV before creating locations (allows user review)- Never create locations without explicit user confirmation- Write comprehensive logs for debugging and records- Support multiple date formats (YYYYMMDD, DDMMYY and YYMMDD)- Handle missing GPS gracefully with fallback coordinates- **Handle encoding errors** with `errors='ignore'` when reading logs- Verify results with SQL queries and CSV checks- Provide ready-to-run bulk import command## Success Criteria✅ No duplicate locations created✅ All folders matched or new locations planned✅ CSV has correct format with all required columns✅ Log file provides audit trail✅ User confirmed before any database writes✅ Verification queries show expected results✅ Bulk import command provided
Note skill: skraak-bulk-import-setup# Bulk Import**Dataset:** [name]**Base folder:** [path]**WAV files:** [count]**Folders:** [count]**Ignore:** Clips_***Recursive:** yes/no## Folders[list]## Log files (GPS)[list, or "none - assume Takaka -40.85085, 172.80703"]Note if multiple logs per folder (average coords)## Output paths- CSV: [path]- Log: [path]- Python: save to base folder## Import command./skraak import bulk --db ./db/skraak.duckdb --dataset [id] --csv "[csv]" --log "[log]"