quietlight/skraak_mcp - Change 3ZSBMHZBDIR3SABGTLQVBGMC2YQCRM4XNVDYGSASBF6FXYJ4J74QC

me, plus renamed skills

Created by quietlight on April 3, 2026

3ZSBMHZBDIR3SABGTLQVBGMC2YQCRM4XNVDYGSASBF6FXYJ4J74QC

Dependencies

In channels

main

Change contents

Insertion in me.txt at line 1 [5.1]

[5.1]

[6.112]

Bookmark check first - agent
day/night in .data
Prepend tool, wav, wav.data, log
Test skill with mono images
Clip from wav when no .data file—skraak save image????

Insertion in me.txt at line 7 [5.1]
[6.113]
[6.113]
```
find morepork mewing sound for dataset
```

Replacement in me.txt at line 135 [5.1]

∅:D[4.215] → [7.222:286]

B:BD[7.222] → [7.222:286]

Crake_Spotless          spocra2
Cuckoo_Shining          shbcuc1

[4.215]

[4.216]

Crake_Spotless          spocra2         x
Cuckoo_Shining          shbcuc1         x

Replacement in me.txt at line 141 [5.1]
∅:D[4.384] → [7.433:465]
B:BD[7.433] → [7.433:465]
```
Fantail                 nezfan1
```
[4.384]
[4.385]
```
Fantail                 nezfan1         x
```
Replacement in me.txt at line 144 [5.1]
∅:D[4.469] → [7.529:561]
B:BD[7.529] → [7.529:561]
```
Kaka                    nezkak1
```
[4.469]
[4.470]
```
Kaka                    nezkak1         x
```
Replacement in me.txt at line 146 [5.1]
∅:D[4.512] → [7.590:622]
B:BD[7.590] → [7.590:622]
```
Kereru                  nezpig2
```
[4.512]
[4.513]
```
Kereru                  nezpig2         x
```

Replacement in me.txt at line 149 [5.1]

∅:D[4.597] → [7.699:858]

B:BD[7.699] → [7.699:858]

Long-tailed Koel        lotkoe1
Morepork                morepo2
Oystercatcher_Variable  varoys1
Parakeet                parake
Pipipi                  pipipi1

[4.597]

[4.598]

Long-tailed Koel        lotkoe1         x
Morepork                morepo2         x
Oystercatcher_Variable  varoys1         x
Parakeet                parake          x
Pipipi                  pipipi1         x

Replacement in me.txt at line 155 [5.1]

∅:D[4.640] → [7.898:962]

B:BD[7.898] → [7.898:962]

Rifleman                riflem1
Robin_Sth_Is            nezrob3

[4.640]

[4.641]

Rifleman                riflem1         x
Robin_Sth_Is            nezrob3         x

Replacement in me.txt at line 160 [5.1]

∅:D[4.767] → [7.1058:1347]

B:BD[7.1058] → [7.1058:1347]

B:BD[7.1347] → [8.160:192]

Saddleback_Sth_Is       saddle3
Silvereye               silver3
South Island Oystercatcher    soioys1
South Island Wren       soiwre1
Thrush_Song             sonthr1
Tomtit                  tomtit1
Tui                     tui1
Warbler_Grey            gryger1
Weka_spp                weka1
Yellowhammer            yellow2

[4.767]

[8.192]

Saddleback_Sth_Is       saddle3         x
Silvereye               silver3         x
South Island Oystercatcher    soioys1   x
South Island Wren       soiwre1         -
Thrush_Song             sonthr1         x
Tomtit                  tomtit1         x
Tui                     tui1            x
Warbler_Grey            gryger1         x
Weka_spp                weka1           x
Yellowhammer            yellow2         x

Replacement in me.txt at line 173 [5.1]

B:BD[8.219] → [8.219:232]

Korero Gecko

[8.219]

[8.232]

Korero Gecko

Replacement in me.txt at line 175 [5.1]
B:BD[8.241] → [4.768:773]
```
Weta
```
[8.241]
[4.773]
```
Weta                                    x
```

Replacement in me.txt at line 177 [5.1]

B:BD[4.774] → [4.774:957]

lotkoe1,morepo2,nezfan1,nezkak1,nezpig2,nezrob3,nibkiw1,okbkiw1,parake,pipipi1,riflem1,shbcuc1,silver3,soioys1,soiwre1,sonthr1,spocra2,tomtit1,tui1,varoys1,weka1,yellow2,weta,saddle3

[4.774]

[4.957]

Insertion in me.txt at line 247 [5.1]

[4.3174]

end
function flatten(signal, freq)
      if ndims(signal) > 1
          signal = vec(mean(signal, dims=2))
      end
      #signal = DSP.resample(signal, 16000.0f0 / freq)
      signal = vec(signal)  # ensure Vector output
      #freq = 16000
      return signal, freq
  end
  for item in a
FLAC.load(FileIO.query(item)) |> x -> flatten(x[1], x[2]) |> x -> get_image_from_sample(x[1], x[2]) |> img -> save(replace(item, "flac"=>"png"), img)

Replacement in .ignore at line 4 [9.1]
B:BD[9.38] → [9.38:46]
```
.claude
```
[9.26]
[5.13284]
```
.claude/settings.local.json
```
Deletion in .ignore at line 9 [9.1]
∅:D[10.2065] → [11.9305:9315]
B:BD[12.11353] → [11.9305:9315]
```
dasel
gum
```
Insertion in .ignore at line 11 [9.1]
[5.13302]
[13.6566]
```
*.flac
*.FLAC
```
File deletion: skraak-propagate-call-classification
BF:BFD[14.22] → [15.2:49]
BF:BFD[15.49] → [15.1:1]

File deletion: SKILL.md

BF:BFD[15.1] → [15.6269:6301]

BF:BFD[15.6301] → [15.51:51]

B:BD[15.51] → [15.52:56]

∅:D[8.327] → [15.99:1206]

B:BD[15.99] → [15.99:1206]

∅:D[3.203] → [15.1275:2008]

B:BD[15.1275] → [15.1275:2008]

∅:D[3.282] → [15.2086:3852]

B:BD[15.2086] → [15.2086:3852]

∅:D[3.331] → [15.3900:4042]

B:BD[15.3900] → [15.3900:4042]

∅:D[3.365] → [15.4075:4249]

B:BD[15.4075] → [15.4075:4249]

∅:D[3.403] → [15.4286:4531]

B:BD[15.4286] → [15.4286:4531]

∅:D[3.501] → [15.4628:6268]

B:BD[15.4628] → [15.4628:6268]

B:BD[15.4531] → [3.404:501]

B:BD[15.4249] → [3.366:403]

B:BD[15.4042] → [3.332:365]

B:BD[15.3852] → [3.283:331]

B:BD[15.2008] → [3.204:282]

B:BD[15.1206] → [3.134:203]

B:BD[15.56] → [8.291:327]

---
description: Auto-classify segments in one filter based on verified classifications from another filter. Use when one ML filter has been reviewed and you want to propagate those classifications to another filter on overlapping segments.
---
# Deduplicate Call Specification
Propagate verified classifications from one filter to another for overlapping segments. This is useful when you have reviewed classifications from one ML model and want to apply those learnings to another model's segments.
## When to Use
- You have reviewed/classified segments for one filter (e.g., `opensoundscape-kiwi-1.2`)
- You want to auto-classify overlapping segments in another filter (e.g., `opensoundscape-kiwi-1.5`)
- Both filters have segments on the same audio files
## Prerequisites
- Folder with `.data` files containing segments from multiple filters
- One filter already reviewed (certainty >= 80 for confirmed calls, certainty = 0 for confirmed noise)
- `./skraak` binary built and available
## Workflow
### Step 1: Identify candidate segments
Run summarise and use jq to find overlapping segments where:
- Target filter has unverified classification (certainty = 70)
- Segments overlap in time
- Species/calltype differ between filters
```bash
./skraak calls summarise --folder "<folder_path>" > /tmp/summary.json
cat /tmp/summary.json | jq '
.segments |
group_by(.file) |
map({
  file: .[0].file,
  segments: [.[] | {start: .start_time, end: .end_time, labels: .labels}]
}) |
map({
  file: .file,
  auto_classify: [
    .segments[] as $s1 |
    .segments[] as $s2 |
    select(
      ($s1.labels[0].filter == "<source_filter>") and
      ($s2.labels[0].filter == "<target_filter>") and
      # Time overlap
      ($s1.start | floor) <= ($s2.end | ceil) and
      ($s2.start | floor) <= ($s1.end | ceil) and
      # Source is verified
      # Target is unverified
      ($s2.labels[0].certainty == 70) and
      # They differ
      (($s2.labels[0].species != $s1.labels[0].species) or ($s2.labels[0].calltype != $s1.labels[0].calltype))
    ) |
    {
      file: .file,
      source_start: $s1.start,
      source_end: $s1.end,
      source_species: $s1.labels[0].species,
      source_calltype: $s1.labels[0].calltype,
      source_certainty: $s1.labels[0].certainty,
      target_start: $s2.start,
      target_end: $s2.end,
      target_species: $s2.labels[0].species,
      target_calltype: $s2.labels[0].calltype,
      action: (if $s1.labels[0].certainty == 0 then "Noise" else $s1.labels[0].species + "+" + ($s1.labels[0].calltype // "N/A") end)
    }
  ]
}) |
map(select(.auto_classify | length > 0)) |
[.[] | .auto_classify[]]
'
```
### Step 2: Present summary to user
Group results by action type:
| Action | Count | Description |
|--------|-------|-------------|
| GSK+Male from "Don't Know" | N | Target had "Don't Know", source verified GSK Male |
| GSK+Female from "Don't Know" | N | Target had "Don't Know", source verified GSK Female |
| GSK+Duet from "Don't Know" | N | Target had "Don't Know", source verified GSK Duet |
| Calltype correction | N | Both had GSK but different calltypes |
| Noise from "Don't Know" | N | Source confirmed as noise (certainty=0) |
| Weka from "Don't Know" | N | Source identified as Weka |
| Other species | N | Other verified species |
Ask user to confirm before applying changes.
### Step 3: Apply approved changes
Use `skraak calls modify` to update target filter segments:
```bash
# For species+calltype changes
./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \
  --filter <target_filter> --segment <start>-<end> \
# For noise
./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \
  --filter <target_filter> --segment <start>-<end> \
# For species without calltype (Weka, etc.)
./skraak calls modify --file "<folder>/<file>.wav.data" --reviewer Claude \
  --filter <target_filter> --segment <start>-<end> \
```
### Step 4: Verify results
```bash
./skraak calls summarise --folder "<folder_path>" --brief
```
Check that both filters now have similar species counts and the "Don't Know" count decreased in the target filter.
## Classification Rules
- **Reviewer**: always "Claude" (indicates LLM-assisted classification)
- **Only modify** segments where target has certainty=70 (unverified)
- **Never modify** segments where target already has certainty >= 80
- **Handle noise carefully**: source certainty=0 means confirmed noise, propagate as Noise species
## Segment Matching
Segments overlap when:
```
floor(source.start) <= ceil(target.end) AND floor(target.start) <= ceil(source.end)
```
The `--segment` parameter uses integer seconds, so floor/ceil handles the matching:
- Source: 362.5-390 matches target: 360-390 via `--segment 362-390`
- Source: 297.5-322.5 matches target: 295-323 via `--segment 297-323`
## Example Output
```
Auto-Classification Candidates for opensoundscape-kiwi-1.5
Based on verified opensoundscape-kiwi-1.2 classifications:
### GSK from "Don't Know" (14 segments)
| File | Segment | Action |
|------|---------|--------|
| ridgeend_20260303_204507 | 297-323 | → GSK+Female |
| ridgeend_20260303_211507 | 292-320 | → GSK+Female |
...
### Calltype Corrections (4 segments)
| File | Segment | Current | Correct |
|------|---------|---------|---------|
| ridgeend_20260304_043007 | 360-388 | GSK+Duet | → GSK+Male |
...
Summary: 27 segments ready for auto-classification
Apply these changes? (y/n)
```
## Notes
- This is a one-way propagation: source filter → target filter
- Run `skraak-check-call-classification` skill first to verify the source filter
- The target filter may have unique segments not in source - these remain unverified
- Species that sound similar (e.g., GSK and LSK) may need human review - flag these for user attention
- **Always use certainty=90** for auto-classified segments (distinguishes from ML=70, human=100)
  --species <Species> --certainty 90
  --species Noise --certainty 90
  --species <Species>+<Calltype> --certainty 90
      (($s1.labels[0].certainty >= 90) or ($s1.labels[0].certainty == 0)) and
- Source filter has verified classification (certainty >= 90 or = 0)
name: propagate-call-classification

File deletion: skraak-check-call-classification
BF:BFD[14.22] → [13.2:45]
BF:BFD[13.45] → [13.1:1]

File deletion: SKILL.md

BF:BFD[13.1] → [13.6403:6435]

BF:BFD[13.6435] → [13.47:47]

B:BD[13.47] → [13.48:252]

∅:D[2.366] → [13.476:2472]

B:BD[13.476] → [13.476:2472]

B:BD[13.2472] → [16.4195:4771]

∅:D[16.4771] → [13.2472:3101]

B:BD[13.2472] → [13.2472:3101]

B:BD[13.3101] → [16.4772:5048]

∅:D[16.5048] → [13.3101:3265]

B:BD[13.3101] → [13.3101:3265]

∅:D[2.522] → [13.3420:3509]

B:BD[13.3420] → [13.3420:3509]

B:BD[13.3509] → [16.5049:5395]

∅:D[16.5395] → [13.3509:4061]

B:BD[13.3509] → [13.3509:4061]

∅:D[2.705] → [13.4243:4627]

B:BD[13.4243] → [13.4243:4627]

∅:D[2.765] → [13.4686:4818]

B:BD[13.4686] → [13.4686:4818]

∅:D[2.844] → [13.4896:5043]

B:BD[13.4896] → [13.4896:5043]

∅:D[2.920] → [16.5396:5527]

B:BD[13.5118] → [16.5396:5527]

∅:D[2.991] → [13.5118:5123]

∅:D[16.5597] → [13.5118:5123]

B:BD[13.5118] → [13.5118:5123]

∅:D[2.1188] → [13.5149:5150]

∅:D[16.6572] → [13.5149:5150]

B:BD[13.5149] → [13.5149:5150]

B:BD[13.5150] → [16.6573:6675]

∅:D[16.6675] → [13.5150:5154]

B:BD[13.5150] → [13.5150:5154]

B:BD[13.5154] → [16.6676:6708]

∅:D[16.6708] → [13.5154:5243]

B:BD[13.5154] → [13.5154:5243]

∅:D[2.1216] → [13.5270:5357]

B:BD[13.5270] → [13.5270:5357]

∅:D[2.1447] → [16.6799:6898]

B:BD[16.6799] → [16.6799:6898]

∅:D[2.1532] → [13.5706:6148]

B:BD[13.5706] → [13.5706:6148]

∅:D[7.1496] → [13.6235:6402]

B:BD[13.6235] → [13.6235:6402]

B:BD[13.6402] → [16.6899:7703]

B:BD[13.6148] → [7.1437:1496]

B:BD[16.6898] → [2.1448:1532]

B:BD[13.5357] → [2.1217:1447]

B:BD[13.5243] → [2.1189:1216]

B:BD[13.5123] → [16.5598:5730]

∅:D[2.1104] → [16.5842:6489]

B:BD[16.5842] → [16.5842:6489]

B:BD[16.6489] → [2.1105:1188]

B:BD[16.5730] → [2.992:1104]

B:BD[16.5527] → [2.921:991]

B:BD[13.5043] → [2.845:920]

B:BD[13.4818] → [2.766:844]

B:BD[13.4627] → [2.706:765]

B:BD[13.4061] → [2.523:705]

B:BD[13.3265] → [2.367:522]

B:BD[13.252] → [2.142:366]

---
name: check-call-classification
description: Check ML-classified bird call spectrograms by visual inspection, confirming or correcting calltype labels in .data files
---
# Check Call Classification
## When to Use
When the user wants to review/check ML calltype classifications on bird call segments. Typical use cases:
- Check all GSK (Great Spotted Kiwi / Roroa) calltypes in a folder
- Review a specific calltype (e.g. just GSK+Male)
- Check "Don't Know" segments for any missed kiwi calls
- Review segments from a specific ML filter
## Prerequisites
- Folder with `.data` files
- Corresponding `.wav` files present (needed for clip generation)
- `./skraak` binary built and available
## Workflow
### Step 1: Scope the review
Ask the user for:
- **Folder path** containing `.data` files
- **Which filter** to check (e.g. `opensoundscape-kiwi-1.2`)
- **Which species+calltypes** to review (e.g. `GSK`, `GSK+Male`, `Don't Know`)
Run summarise to show what's available:
```bash
./skraak calls summarise --folder "<folder_path>" --brief > /tmp/summarise.json 2>&1
```
Show the user the filters, species, calltypes, and review status. Confirm:
- The exact species labels to use for modifications (e.g. `GSK+Male`, `GSK+Female`, `GSK+Duet`, `Noise`, `Don't Know`)
- Whether to skip certainty 100 segments (default: yes, these are human-reviewed)
### Step 2: Load reference spectrograms
Look for reference examples in this skill's folder. Map species to reference folder:
- GSK / Great Spotted Kiwi -> `Roroa/`
- (Future species get their own folders)
The folder structure is `<common_name>/<calltype>/` containing paired `.png` and `.wav` example files.
Read 3 representative PNG images from each calltype subfolder (pick files evenly spaced through the sorted file list). Study the visual patterns:
**GSK/Roroa calltypes:**
- **Male Solo**: bold, regularly spaced vertical bars at lower frequency, strong signal
- **Female Solo**: thinner, higher-frequency bars, often fainter/more delicate
- **Duet**: overlapping male + female patterns visible simultaneously, two distinct frequency bands
**Noise/False positive**: no clear repeated bar pattern, random speckle or broadband noise
**Note on Frequency Scaling:** Spectrogram frequency ranges may vary based on recording
equipment (8kHz/16kHz/32kHz). Focus on the **relative pattern** (spacing, thickness,
regularity of vertical bars) rather than absolute frequency position. Male calls show
bold, regularly-spaced bars at lower frequencies relative to the display; female calls
are thinner and higher relatively. 
If the samaple rate is above 16kHz, audio is downsampled to 16kHz, if sample rate is below 16kHz, it is left as is fro the spectrogram. In practice images will be 8kHz (most common) or 16kHz.
### Step 3: Generate clips for the folder
Generate all clips at once using the folder mode:
```bash
./skraak calls clip --folder "<folder_path>" --prefix <prefix> --output /tmp/<prefix>/ \
  --filter <filter> --species <species> --size 224 --color
```
Use `--species GSK` (no calltype) to get all calltypes, or `--species GSK+Male` for a specific calltype.
**Before reading any images**, report the number of spectrograms generated and ask the user to confirm they are OK with the token cost. Each 224px PNG is one image token. Give the user the option to proceed, reduce scope (e.g. filter to a specific calltype), or abort.
### Token Efficiency Guidelines
- **<100 segments**: Review all at once
- **100-300 segments**: Review in one batch (acceptable cost)
- **>300 segments**: Consider filtering by calltype (e.g., review Duets/Females first, 
  then Males) or by date range to manage token costs
### Step 4: Examine spectrograms
Read each generated `.png` file using the Read tool. Compare against reference images from Step 2.
For each segment, determine:
- **Skip**: too faint or uncertain -> leave as-is at current certainty (never downgrade)
### When to Skip (Leave at certainty 70)
- **Duration <15 samples** (very short segments)
- **Signal-to-noise ratio too low** (fainter than reference examples)
- **Fragmented/incomplete patterns** (file boundary artifacts)
- **Uncertain classification** (pattern doesn't clearly match reference)
- **Never guess** - skip rather than misclassify
Parse segment start-end times from the clip filename: `<prefix>_<basename>_<START>_<END>.png`
Map back to the source .data file using the basename (e.g. `check_tx45_20260304_203005_310_345.png` -> `tx45_20260304_203005.wav.data`, segment `310-345`).
### Step 5: Present batch summary
Present results in a table:
```
Folder: tx45/202603 march (filter: opensoundscape-kiwi-1.2, species: GSK)
#  | File                  | Segment | Current Label   | Action         | Notes
---|----------------------|---------|----------------|----------------|------
3  | tx45_20260304_211504 | 442-463 | GSK+Male (70)  | Skip           | Too faint to confirm
```
**Ask user to approve** before executing any modifications. User can override individual decisions.
### Step 6: Execute approved changes
Confirm correct classification (bump certainty only):
```bash
./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \
```
Reclassify to different calltype:
```bash
./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \
```
Mark as noise (only if clearly not a bird call):
```bash
./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \
```
Bookmark an outstanding example:
```bash
./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \
```
# View a specific modified file
cat <folder>/<filename>.wav.data | grep -A5 '"segment_start":<start>'
```
### Step 8: Final summary
```
Review Complete
Folder: <path>
Filter: <filter>, Species: <species>
Segments reviewed: N
  Reclassified:      B (%)
  Skipped (left as-is): C (%)
```
## Classification Rules
- **Reviewer**: always the actual model being used, eg haiku-4.5, kimi-k2.5, glm-5, minimax-m2.7, 
- **Always ask** the user for labels if you see something identifiable that isn't in the expected species list
- Use `calls summarise --brief` to check existing labels in the folder
## Notes
- Image size defaults to 224px with `--color`. Use `--size 448` if more detail needed (higher token cost)
- Always report clip count and ask user before reading images (token cost awareness)
- Distant/faint calls are common - skip rather than guess
- The ML models can be ~97% accurate, so most classifications will often be correct. The goal is to increase certainty on correct ones, not to second-guess the model.
## Calltype-Specific Tips
### GSK/Roroa Specific Guidance
- **Male**: Most common, look for bold vertical stripes at ~1/3 to 1/2 of spectrogram height
- **Female**: Higher frequency, thinner bars, often fainter - don't mistake faintness for absence
- **Duet**: Both patterns present simultaneously - if you see ANY female component with male, it's a Duet
- **Distant calls**: Fainter versions of same patterns are still valid - only skip if pattern is unclear
### Reference Image Strategy
1. Read 2-3 reference images from EACH calltype folder
2. Mentally note the **key distinguishing features**:
   - Male: Bar thickness, spacing regularity, relative frequency position
   - Female: Thin bars, higher position, delicate appearance  
   - Duet: Two distinct frequency bands with overlapping patterns
- The `.data.bak` files serve as backups never modify them
- **Never modify** segments with certainty >= 90 (already reviewed by LLM or human)
- **Only upgrade certainty**: confirm (70 -> 90) or reclassify (-> 90). Never downgrade. If uncertain, skip the segment and leave it as-is.
- **Certainty 90**: use for all LLM classifications (distinguishes from ML=70, human=100)
  Confirmed (-> 90): A (%)
Add a comment to preserve information:
```bash
./skraak calls modify --file "<folder>/<basename>.wav.data" --reviewer <Your Name> \
```
**Bookmark**: Use `--bookmark` to mark outstanding, clear, or beautiful examples of a call. Bookmarked segments can be easily found later for training reference or demonstration purposes.
**Comment**: Use `--comment "text"` to preserve useful information about a segment. Comments are stored in the .data file and visible when classifying calls manually. Max 140 characters, ASCII only. Examples:
- "Clear duet with visible female component"
- "Unusual timing pattern, verify with audio"
- "Good training example for male vs female distinction"
### Step 7: Verify Changes (Optional)
Check that modifications were applied correctly:
```bash
# Count segments at certainty 90
grep -r '"certainty": 90' <folder>/*.data | wc -l
  --filter <filter> --segment <start>-<end> --certainty 90 --comment "Clear duet with visible female component"
  --filter <filter> --segment <start>-<end> --certainty 90 --bookmark
  --filter <filter> --segment <start>-<end> --species Noise --certainty 90
  --filter <filter> --segment <start>-<end> --species GSK+Duet --certainty 90
  --filter <filter> --segment <start>-<end> --certainty 90
1  | tx45_20260304_204504 | 132-163 | GSK+Male (70)  | Confirm (90)   | Clear male pattern
2  | tx45_20260305_040003 | 352-395 | GSK+Male (70)  | -> GSK+Duet(90)| Female visible too
- **Confirm**: classification is correct -> upgrade certainty to 90
- **Reclassify**: wrong calltype -> modify to correct species+calltype at certainty 90
Review ML-classified bird call segments by visually inspecting spectrogram images. Confirm correct classifications or reclassify incorrect ones. This skill can only **upgrade** certainty (e.g. 70 -> 90), never downgrade it.

File deletion: Roroa
BF:BFD[13.1] → [15.6304:6320]
BF:BFD[15.6320] → [15.6303:6303]
File deletion: Duet
BF:BFD[15.6303] → [15.6361:6376]
BF:BFD[15.6376] → [15.6360:6360]
File deletion: Female
BF:BFD[15.6303] → [15.6341:6358]
BF:BFD[15.6358] → [15.6340:6340]
File deletion: Male
BF:BFD[15.6303] → [15.6323:6338]
BF:BFD[15.6338] → [15.6322:6322]
File deletion: skraak-import-segments
BF:BFD[14.22] → [14.10601:10634]
BF:BFD[14.10634] → [14.10600:10600]

File deletion: SKILL.md

BF:BFD[14.10600] → [14.13475:13507]

BF:BFD[14.13507] → [14.10636:10636]

B:BD[14.10636] → [14.10637:13474]

---
name: import-segments
description: Import classified segments from a single folder of .data files into the database - looks up IDs, creates mapping, runs import
---
# Import Segments from a Single Folder
Import manually classified AviaNZ `.data` file segments into the database for one folder (one cluster).
## When to Use
When the user wants to import classified `.data` files from a folder. The folder must already have `.data` files (from `calls classify` or `calls from-preds`) and the WAV files must already be imported into the database.
## Prerequisites
Before this skill runs, the user should have already:
1. Generated predictions and created `.data` files (`calls from-preds` or similar)
2. Classified calls (`calls classify`)
3. Imported the WAV files into the database (`import folder` or `import bulk`)
## Workflow
### Step 1: Identify the folder
Get the folder path from the user. Confirm `.data` files exist:
```bash
ls <folder>/*.data | head -5
```
### Step 2: Determine dataset, location, and cluster IDs
Extract the cluster/location name from the folder path (e.g., `tx51` from `.../tx51/`). Query the database to find matching IDs:
```bash
./skraak sql --db ./db/skraak.duckdb "SELECT c.id AS cluster_id, c.name AS cluster_name, l.id AS location_id, l.name AS location_name, d.id AS dataset_id, d.name AS dataset_name FROM cluster c JOIN location l ON c.location_id = l.id JOIN dataset d ON l.dataset_id = d.id WHERE c.active = true AND c.name ILIKE '%tx51%'"
```
Adjust the `ILIKE` pattern based on the folder name. If multiple results, ask the user to pick. Show the user the matched dataset/location/cluster and confirm before proceeding.
### Step 3: Check mapping file
Look for an existing `mapping_*.json` in the folder:
```bash
ls <folder>/mapping_*.json
```
If no mapping exists, tell the user to run the `skraak-data-mapping` skill first:
```
No mapping file found. Please run: /skraak-data-mapping
```
Then stop and wait for the user.
### Step 4: Run the import
```bash
./skraak import segments \
  --db ./db/skraak.duckdb \
  --dataset <dataset_id> \
  --location <location_id> \
  --cluster <cluster_id> \
  --folder "<folder_path>" \
  --mapping "<folder_path>/mapping_YYYY-MM-DD.json"
```
Use the production database (`./db/skraak.duckdb`) unless the user specifies otherwise.
### Step 5: Report results
Show the user the import summary from the JSON output:
- Data files processed
- Segments imported
- Labels imported
- Any errors
## Error Handling
- If no `.data` files found: abort with clear message
- If cluster/location not found in DB: ask user for the correct name or IDs
- If files not yet imported (hash mismatch): tell user to run `import folder` first
- If segments already have labels: warn user - `import segments` only works on files with no existing labels

File deletion: skraak-dataset-report
BF:BFD[14.22] → [14.13510:13542]
BF:BFD[14.13542] → [14.13509:13509]

File deletion: SKILL.md

BF:BFD[14.13509] → [14.15544:15576]

BF:BFD[14.15576] → [14.13544:13544]

B:BD[14.13544] → [14.13545:15543]

---
name: dataset-report
description: Generate a report showing dataset breakdown - locations, file counts per location and per dataset - handling both structured and unstructured dataset types
---
# Dataset Report
Generate a summary report of all active datasets showing location counts, file totals, and files per location.
## When to Use
When the user asks for a dataset overview, summary, breakdown, or report.
## Query
Structured datasets link files through `location > cluster > file`. Unstructured datasets link files through the `file_dataset` junction table with no location hierarchy.
```sql
WITH structured AS (
    SELECT
        d.name AS dataset,
        d.type,
        COUNT(DISTINCT l.id) AS locations,
        COUNT(f.id) AS total_files
    FROM dataset d
    LEFT JOIN location l ON d.id = l.dataset_id AND l.active = true
    LEFT JOIN cluster c ON l.id = c.location_id AND c.active = true
    LEFT JOIN file f ON c.id = f.cluster_id AND f.active = true
    WHERE d.active = true AND d.type != 'unstructured'
    GROUP BY d.name, d.type
),
unstructured AS (
    SELECT
        d.name AS dataset,
        d.type,
        0 AS locations,
        COUNT(fd.file_id) AS total_files
    FROM dataset d
    LEFT JOIN file_dataset fd ON d.id = fd.dataset_id
    LEFT JOIN file f ON fd.file_id = f.id AND f.active = true
    WHERE d.active = true AND d.type = 'unstructured'
    GROUP BY d.name, d.type
),
combined AS (
    SELECT * FROM structured
    UNION ALL
    SELECT * FROM unstructured
)
SELECT
    dataset,
    type,
    locations,
    total_files,
    CASE WHEN locations > 0
         THEN ROUND(total_files::DECIMAL / locations, 0)
         ELSE NULL END AS files_per_location
FROM combined
ORDER BY total_files DESC
```
## Output
Present results as a markdown table with a summary line showing totals.
- Use `-` for files_per_location on unstructured datasets (no location hierarchy)
- Format large numbers with commas for readability
- Include dataset type column

File deletion: skraak-data-mapping
BF:BFD[14.22] → [14.15579:15609]
BF:BFD[14.15609] → [14.15578:15578]

File deletion: SKILL.md

BF:BFD[14.15578] → [14.19534:19566]

BF:BFD[14.19566] → [14.15611:15611]

B:BD[14.15611] → [14.15612:19533]

---
name: data-mapping
description: Build a mapping.json file that translates species and calltype names from .data files to DB labels, using interactive prompts
---
# Data File Species/Calltype Mapping
Build a `mapping.json` that translates `.data` file species and calltype names to database `species.label` and `call_type.label` values. Map-only: never create new species or calltypes in the DB.
## When to Use
When the user needs to map species/calltypes from `.data` files before importing selections. Typically run before `skraak import selections`.
## Workflow
### Step 1: Summarise .data files
```bash
./skraak calls summarise --folder <folder> --brief
```
Parse the JSON output. Extract unique species and per-species calltypes from the `filters` object. Each filter has:
- `species`: map of species name to count
- `calltypes`: map of species name to (calltype name to count)
Collect all unique species across all filters, and for each species collect all unique calltypes.
### Step 2: Query DB for available species and calltypes
```bash
./skraak sql --db ./db/skraak.duckdb "SELECT id, label FROM species WHERE active = true"
./skraak sql --db ./db/skraak.duckdb "SELECT ct.label, s.label as species FROM call_type ct JOIN species s ON ct.species_id = s.id WHERE ct.active = true ORDER BY s.label, ct.label"
```
Parse the JSON results. Build a lookup: DB species labels, and per-species DB calltype labels.
### Step 3: Interactive mapping
For each unique species found in `.data` files:
1. Show the user the .data species name and its count
2. Use `AskUserQuestion` with DB species labels as options (pick the most likely matches, up to 4 options + "Skip - no match")
3. If the user picks a DB species and that species has calltypes in the .data files:
   - For each .data calltype, use `AskUserQuestion` with that DB species' calltypes as options (up to 4 + "Keep as-is")
   - "Keep as-is" means the .data calltype name equals the DB calltype name (no mapping needed, omit from calltypes map)
### Step 4: Write mapping.json
Save to `<folder>/mapping_YYYY-MM-DD.json` (using the current date, e.g. `mapping_2026-03-14.json`):
```json
{
  "Don't Know": {
    "species": "Don't Know"
  },
  "GSK": {
    "species": "Haast Tokoeka",
    "calltypes": {
      "Male": "Male - Solo",
      "Female": "Female - Solo"
    }
  },
  "Morepork": {
    "species": "Morepork"
  }
}
```
Structure rules:
- Top-level keys = .data file species names
- `species` value = DB species label
- `calltypes` map = only present if calltypes exist AND at least one needs remapping
- Each calltype entry: .data calltype name -> DB calltype label
- Omit calltypes that map to themselves (user chose "Keep as-is")
- If a species maps to itself with no calltype remapping, still include it with just `{"species": "SpeciesName"}`
- If user chose "Skip", warn and omit that species entirely
### Step 5: Display summary
Print the final `mapping.json` content to the user as formatted JSON.
## Error Handling
- If a .data species has no reasonable DB match, warn and skip (don't write to mapping)
- If a DB species has no calltypes defined but .data files have calltypes for it, warn the user (the calltypes will be ignored on import)
- If `calls summarise` finds no .data files, abort with clear message
- Never create new species or calltypes in the DB
## Example Session
```
Found 4 species in .data files: GSK (342), Don't Know (28), Morepork (15), Weka (3)
Mapping species: "GSK" (342 segments)
> Which DB species? [Haast Tokoeka | Great Spotted Kiwi | Stewart Island Tokoeka | Other | Skip]
User picks: Haast Tokoeka
GSK has calltypes: Male (200), Female (120), Duet (22)
> Map calltype "Male"? [Male - Solo | Male - Duet | Keep as-is | Other]
User picks: Male - Solo
...
Mapping species: "Don't Know" (28 segments)
> Which DB species? [Don't Know | Skip]
User picks: Don't Know
Writing mapping.json...
```

File deletion: skraak-bulk-import-setup
BF:BFD[14.22] → [14.19569:19604]
BF:BFD[14.19604] → [14.19568:19568]

File deletion: SKILL.md

BF:BFD[14.19568] → [14.46774:46806]

BF:BFD[14.46806] → [14.20218:20218]

B:BD[14.20218] → [14.20219:46773]

---
name: bulk-import-setup
description: Set up bulk import for acoustic monitoring datasets - fuzzy match existing location NAMES, create new ones with GPS parsing if log.txt present in folder, and generate import CSV with comprehensive logging
---
# Bulk Import Setup for Acoustic Monitoring Datasets
This skill helps you prepare acoustic monitoring datasets for bulk import by analyzing locations, matching existing records, creating new locations with optional GPS parsing, and generating import CSV files.
## When to Use This Skill
Use this skill when the user needs to:
- Import a batch of acoustic recordings organized in location folders
- Create multiple location records for a dataset
- Match folder names to existing database locations (fuzzy matching)
- Parse GPS coordinates from log files
- Generate CSV files for bulk file import
## Workflow Overview
1. **Analyze** directory structure and existing locations
2. **Fuzzy match** folder names to database locations (NAME ONLY - never GPS!)
3. **Parse GPS log.txt files** from folders
4. **Prepare CSV** with date-range cluster names and file counts
5. **Check with user** before creating any new locations
6. **Create locations** (if confirmed) with GPS or fallback coordinates
7. **Update existing locations** with GPS data (if they have assumed coords)
8. **Update CSV** with final location IDs
9. **Verify** and provide bulk import command
**NEVER match on GPS coordinates!** Recorders can be physically close but represent different locations.
```python
# ❌ WRONG - Don't compare GPS distances
if abs(loc1_lat - loc2_lat) < 0.001:  # NO!
    return "match"
# ✅ CORRECT - Only compare normalized names
normalized_name1 = normalize_location_name(folder_name)
normalized_name2 = normalize_location_name(existing_name)
score = SequenceMatcher(None, normalized_name1, normalized_name2).ratio()
```
### 2. GPS Log Files are INSIDE Folders
**Parse log files by folder structure**, not by separate log file list.
```python
# ✅ CORRECT - Log file is inside the folder
folder_path = "/path/to/mok_bl76/"
log_path = folder_path / "*log.txt"  # Inside the folder!
# Extract folder name from path for mapping
folder_name = Path(folder_path).name  # "mok_bl76"
gps_data[folder_name] = parse_gps_log(log_path)
```
### 3. Update Matched Locations with GPS Data
If fuzzy matching finds an existing location with "assumed" coordinates, **update it** with GPS data if GPS data is available.
```python
if matched_location:
    # Check if location has assumed Takaka coordinates
    if matched_location['description'].contains("assumed") or \
       matched_location['description'].contains("Takaka") or \
       matched_location['description'].contains("pending GPS"):
        # We have GPS data - UPDATE the location!
        if gps_data:
            update_location(matched_location['id'], gps_data['lat'], gps_data['lon'])
```
### 4. Cluster Names = Date Ranges (NOT Location Names!)
**Cluster names represent time periods**, not locations.
```python
# ❌ WRONG - Using location name
cluster_name = "mok_bl76"
# ✅ CORRECT - Using date range from files
first_file = "mok_bl76_20250422_173008.wav"
last_file = "mok_bl76_20250503_064508.wav"
cluster_name = "2025-04-22 to 2025-05-03"
```
### 5. CSV Must Have 6 Columns (Including file_count)
The bulk import expects **exactly 6 columns**:
```csv
location_name,location_id,directory_path,date_range,sample_rate,file_count
mok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68```
**Missing file_count will cause**: `Error: CSV row 2 has insufficient columns (expected 6, got 5)`
## Step 1: Analyze Directory Structure
First, understand the organization:
```python
# Key questions to answer:
- How many location folders?
- Naming patterns (dash/underscore)?
- GPS log files present?
- File counts per location?
- Date formats in filenames?
```
**Check for:**
- Log files: `LOCATION-log.txt` or `log.txt`
- GPS data format: `GPS (lat,long): -40.123456,172.123456`
- Filename date patterns: `YYYYMMDD` or `DDMMYY` or `YYMMDD`
## Step 2: Fuzzy Match Existing Locations
**IMPORTANT**: Always query existing locations first to avoid duplicates!
### Fuzzy Matching Rules
**CRITICAL: Use strict threshold (0.90) to avoid false matches!**
```python
from difflib import SequenceMatcher
def normalize_location_name(name: str) -> str:
    """Normalize location name for fuzzy matching."""
    # Remove common suffixes/prefixes
    name = name.lower()
    name = re.sub(r'[_\-\s]+', '', name)  # Remove underscores, hyphens, spaces
    name = re.sub(r'(low|high)$', '', name)  # Remove Low/High suffix
    name = re.sub(r'^mok', 'mok', name)  # Normalize mok prefix
    return name
def fuzzy_match_score(s1: str, s2: str) -> float:
    """Calculate fuzzy match score between two strings."""
    return SequenceMatcher(None, normalize_location_name(s1), normalize_location_name(s2)).ratio()
def fuzzy_match_location(folder_name: str, existing_locations: List[Dict]) -> Optional[Dict]:
    """
    Fuzzy match folder name to existing location BY NAME ONLY.
    IMPORTANT:
    - Returns match ONLY if similarity > 0.90
    - NEVER compares GPS coordinates (locations can be close but different!)
    This threshold prevents false matches like:
    - "mok_bl6P" matching to "mok_bl16" (score: 0.86) ✗ Different locations!
    - "mok_tuna51" matching to "mok_tuna31" (score: 0.89) ✗ Different locations!
    But allows:
    - "mok_bl6P_Low" matching to "mok_bl6P" (score: 1.00) ✓ Same location
    """
    normalized_folder = normalize_location_name(folder_name)
    best_match = None
    best_score = 0.0
    for loc in existing_locations:
        normalized_loc = normalize_location_name(loc['name'])
        score = SequenceMatcher(None, normalized_folder, normalized_loc).ratio()
        if score > best_score:
            best_score = score
            best_match = loc
    # Strict threshold - only match if very similar
    if best_score > 0.90:
        print(f"  Matched '{folder_name}' -> '{best_match['name']}' (score: {best_score:.3f})")
        # Check if matched location needs GPS update
        if 'description' in best_match:
            desc = best_match['description'].lower()
            if 'assumed' in desc or 'takaka' in desc or 'pending gps' in desc:
                print(f"  ⚠️  Location has assumed coords - check for GPS data to update")
        return best_match
    return None
```
**Why 0.90 threshold?**
- Too low (0.80-0.85): False matches → wrong location assignments
- 0.90: Only matches very similar names (case/punctuation differences)
- Example: `mok_bl6P_Low` (score: 1.00) ✓ vs `mok_bl16` (score: 0.86) ✗
### Query Existing Locations
```python
cmd = [
    SKRAAK_BIN, "sql",
    "--db", db_path,
    f"SELECT name, id FROM location WHERE dataset_id = '{dataset_id}' AND active = true",
    "--limit", "500"
]
```
**Note**: If production DB is locked, ask to use use backup database for querying, but note that locations will be created in production later.
**Note**: Don't forget to check some file sample rates using skraak metadata command, or sox.
## Step 3: Prepare CSV with Analysis
**Create preparation script** that:
1. Matches folders to existing locations
2. Identifies which locations need to be created
3. Analyzes file metadata (count, sample rate, date range)
4. Generates CSV with placeholder IDs for new locations
5. Writes comprehensive log file
### CSV Format
**CRITICAL: Must have exactly 6 columns with date-range cluster names!**
```csv
location_name,location_id,directory_path,date_range,sample_rate,file_count
mok_bl11,HA4TDv3DlhjX,/media/david/Misc-2/Manu o Kahurangi kiwi survey (3)/Data/K3_mokbl_boulder lake track/mok_bl11,2024-12,8000,68
```
**Key points:**
- `dataset_id`: 12-character nanoid
- `location_id`: Actual ID or `<CREATE:name>` placeholder
- `cluster_name`: **Date range** (e.g., "2025-04-22 to 2025-05-03") - NEVER location name!
- `folder_path`: Full absolute path
- `sample_rate`: Integer (e.g., 8000, 32000)
- `file_count`: Integer count of WAV files in folder
### Log File Format
Create detailed log showing:
- Each location processed
- Match status (existing or new)
- File counts and metadata
- Summary statistics
- Next steps
```python
LOG_FILE = BASE_DIR / "import_log.txt"
with open(LOG_FILE, 'w') as log_file:
    log(f"Processing: {folder_name}", log_file)
    if matched:
        log(f"  ✓ MATCH: {folder} → {existing} (ID: {id})", log_file)
    else:
        log(f"  + NEW: {folder} (will create)", log_file)
    log(f"    Files: {count}, Sample Rate: {rate} Hz, Date: {date}", log_file)
```
## Step 4: Check with User Before Creating Locations
**CRITICAL**: Never create locations without user confirmation!
Present summary and ask:
```
PLAN READY - AWAITING CONFIRMATION
===================================
Matched existing: 17
Need to create: 5
Total files: 8,003
Locations to be created:
  - new-location-1 (420 files)
  - new-location-2 (350 files)
  ...
Shall I proceed to create these 5 new locations?
```
**Wait for user approval** before executing any location creation!
## Step 5: Create Locations (If Confirmed)
### GPS Parsing (If Available)
```python
def parse_gps_from_log(log_file: Path) -> Tuple[float, float, int]:
    """
    Parse and average GPS coordinates from log file.
    IMPORTANT: Use errors='ignore' to handle encoding issues!
    Some log files contain non-UTF-8 bytes that will crash without this.
    """
    if not log_file.exists():
        return None, None, 0
    gps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')
    latitudes = []
    longitudes = []
    # CRITICAL: errors='ignore' prevents UnicodeDecodeError
    with open(log_file, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            match = gps_pattern.search(line)
            if match:
                latitudes.append(float(match.group(1)))
                longitudes.append(float(match.group(2)))
    if latitudes:
        avg_lat = sum(latitudes) / len(latitudes)
        avg_lon = sum(longitudes) / len(longitudes)
        return (avg_lat, avg_lon, len(latitudes))
    return None, None, 0
```
### Location Creation
```python
# Use CLI tool (not MCP) for scripting
cmd = [
    "./skraak", "create", "location",
    "--db", db_path,
    "--dataset", dataset_id,
    "--name", location_name,
    "--lat", str(latitude),
    "--lon", str(longitude),
    "--timezone", "Pacific/Auckland",  # NZ default
    "--description", description
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
response = json.loads(result.stdout)
location_id = response["location"]["id"]
```
### Description Format
**With GPS:**
```python
description = f"Averaged from {fix_count} GPS readings"
# Example: "Averaged from 14 GPS readings"
```
**Without GPS (fallback to Takaka if Kahurangi data, Te Anau if Fiordland data):**
```python
description = "Location assumed to be Takaka pending GPS data"
# Note: Takaka coords are -40.85085, 172.80703
```
**Generic (user preference - if provided):**
```python
description = "Friends of Cobb - Assumed location (Takaka) - no GPS data available"
```
## Step 6: File Metadata Analysis
### Count WAV Files (RECURSIVE!)
**CRITICAL: Always count recursively!** Files may be organized in daily subfolders.
```python
def count_wav_files(folder_path: str) -> int:
    """
    Count WAV files recursively (searches all subdirectories).
    IMPORTANT: Don't use glob("*.wav") - it only searches top level!
    Files are often organized in daily subfolders:
      location/
        2025-03-08/
          file1.wav
          file2.wav
        2025-03-09/
          file3.wav
    """
    count = 0
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                count += 1
    return count
```
### Extract Sample Rate (RECURSIVE!)
```python
def get_sample_rate(folder_path: str) -> int:
    """
    Get sample rate from first WAV file (searches recursively).
    IMPORTANT: Search recursively since files may be in subfolders!
    """
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                file_path = os.path.join(root, file)
                # Read WAV header (sample rate at bytes 24-27)
                try:
                    with open(file_path, 'rb') as f:
                        f.seek(24)
                        sample_rate = int.from_bytes(f.read(4), byteorder='little')
                        return sample_rate
                except:
                    pass
    return 8000  # Default fallback
```
### Parse Date Range (RECURSIVE!)
**Support both formats:**
```python
def extract_date_from_filename(filename: str) -> Optional[str]:
    """Extract date from filename like mok_bl6P_20250308_200003.wav"""
    match = re.search(r'(\d{8})_', filename)
    if match:
        date_str = match.group(1)
        # Convert YYYYMMDD to YYYY-MM
        return f"{date_str[:4]}-{date_str[4:6]}"
    return None
def determine_date_range(folder_path: str) -> str:
    """
    Determine date range from WAV filenames (searches recursively).
    IMPORTANT: Search recursively since files may be in daily subfolders!
    """
    dates = set()
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                date = extract_date_from_filename(file)
                if date:
                    dates.add(date)
    if dates:
        sorted_dates = sorted(dates)
        if len(sorted_dates) == 1:
            return sorted_dates[0]
        else:
            return f"{sorted_dates[0]} to {sorted_dates[-1]}"
    return "unknown"
```
## Step 7: Update CSV and Verify
After creating locations, update CSV with real IDs:
```python
# Update TO_BE_CREATED → real location IDs
# Regenerate CSV with all real IDs
```
### Verification Queries
```sql
-- Check all locations created
SELECT id, name, latitude, longitude, SUBSTRING(description, 1, 50)
FROM location
WHERE dataset_id = 'DATASET_ID'
  AND name LIKE 'PREFIX%'
  AND active = true
ORDER BY name;
-- Verify row count matches expected
```
### CSV Verification
```bash
# Should be N+1 rows (header + N locations)
wc -l import.csv
# Sum file counts
cat import.csv | tail -n +2 | cut -d',' -f6 | paste -sd+ | bc
```
## Database Safety
### Use Backup for Queries if Production Locked
```python
# Query backup database for existing locations
QUERY_DB = Check, expect something like "/home/david/go/src/skraak/db/backup_YYYY-MM-DD_X.duckdb"
# Create locations in production (when unlocked)
PROD_DB = "/home/david/go/src/skraak/db/skraak.duckdb"
```
### Always Default to Test Database
⚠️ **Unless user explicitly specifies production:**
```python
DB_PATH = "/home/david/go/src/skraak/db/test.duckdb"  # Default
```
## Example Usage
### Example 1: All Locations Already Exist
**User Request:**
```
Setup bulk import for Friends of Cobb dataset:
- Dataset ID: QZ0tlUrX4Nyi
- 17 location folders
- Use Takaka coordinates (-40.85085, 172.80703)
- Generate CSV at: /path/to/foc_import.csv
```
**Your Response:**
1. Query existing locations (17 found)
2. Fuzzy match all 17 folders → all matched!
3. Generate CSV with real location IDs
4. Report: "All locations exist, CSV ready for import"
5. Provide bulk import command
### Example 2: Create New Locations (MOK Import)
**User Request:**
```
Import Manu o Kahurangi kiwi survey:
- 23 location folders
- 8 have GPS log files
- 15 need Takaka default coords
- Generate CSV at: /path/to/mok_import.csv
```
**Your Workflow:**
1. **Parse GPS logs** (8 locations):
   ```python
   GPS_DATA = {
       "mok_bl6P_Low": {"lat": -40.823070, "lon": 172.582703, "fixes": 14},
       "mok_bl6T_Low": {"lat": -40.797182, "lon": 172.594582, "fixes": 14},
       ...
   }
   ```
2. **Query existing locations** with strict fuzzy matching (threshold 0.95):
   - Result: No matches found (all new locations)
3. **Count files recursively** (files in daily subfolders):
   - Total: 15,212 WAV files across 23 locations
4. **Present plan to user**:
   ```
   PLAN: Create 23 new locations
   - 8 with GPS coordinates (averaged from log files)
   - 15 with Takaka defaults
   Total files: 15,212
   Proceed? [yes/no]
   ```
5. **After approval, create all locations**:
   ```python
   for location in locations:
       if location in GPS_DATA:
           gps = GPS_DATA[location]
           description = f"Averaged from {gps['fixes']} GPS readings"
       else:
           gps = {'lat': DEFAULT_LAT, 'lon': DEFAULT_LON}
           description = "Location assumed to be Takaka pending GPS data"
       location_id = create_location(location, gps['lat'], gps['lon'], description)
   ```
6. **Generate CSV** with real location IDs
7. **Verify**:
   - Query database to confirm all 23 locations created
   - Verify CSV has 23 rows + header
   - Provide bulk import command
## Output Summary Template
```
======================================================================
BULK IMPORT SETUP COMPLETE
======================================================================
Dataset: Dataset Name (ID: abc123)
Database: /path/to/db
Locations processed: 17/17
  - Existing (matched): 17
  - New (created): 0
  - Failed: 0
Total files: 8,003
CSV Generated: /path/to/import.csv
Log file: /path/to/import_log.txt
Ready for bulk import!
Run:
  cd /home/david/go/src/skraak
  ./skraak import bulk --db ./db/skraak.duckdb \
    --dataset DATASET_ID \
    --csv "/path/to/import.csv" \
    --log /path/to/bulk_import.log
```
## All-in-One Python Script Template
For complex imports with many locations, create a comprehensive Python script:
```python
#!/usr/bin/env python3
"""
Bulk import setup: Parse GPS, create locations, generate CSV
"""
import os
import re
import json
import subprocess
from pathlib import Path
from difflib import SequenceMatcher
from typing import Dict, List, Optional, Tuple
# Configuration
DATASET_ID = "abc123xyz789"
DB_PATH = "./db/skraak.duckdb"
TIMEZONE = "Pacific/Auckland"
BASE_PATH = "/path/to/data"
DEFAULT_LAT = -40.85085
DEFAULT_LON = 172.80703
# GPS log files (if available)
GPS_LOGS = ["location1/log.txt", "location2/log.txt"]
# Location folders
FOLDERS = ["location1", "location2", ...]
def parse_gps_log(log_path: str) -> Optional[Tuple[float, float, int]]:
    """Parse GPS log and return (lat, lon, fix_count)"""
    if not os.path.exists(log_path):
        return None
    gps_pattern = re.compile(r'GPS \(lat,long\): (-?\d+\.\d+),(-?\d+\.\d+)')
    coords = []
    with open(log_path, 'r', encoding='utf-8', errors='ignore') as f:
        for line in f:
            match = gps_pattern.search(line)
            if match:
                coords.append((float(match.group(1)), float(match.group(2))))
    if coords:
        avg_lat = sum(c[0] for c in coords) / len(coords)
        avg_lon = sum(c[1] for c in coords) / len(coords)
        return (avg_lat, avg_lon, len(coords))
    return None
def count_wav_files_recursive(folder_path: str) -> int:
    """Count WAV files recursively"""
    count = 0
    for root, dirs, files in os.walk(folder_path):
        count += sum(1 for f in files if f.lower().endswith('.wav'))
    return count
def get_sample_rate_recursive(folder_path: str) -> int:
    """Get sample rate from first WAV file found"""
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.wav'):
                try:
                    with open(os.path.join(root, file), 'rb') as f:
                        f.seek(24)
                        return int.from_bytes(f.read(4), byteorder='little')
                except:
                    pass
    return 8000
def create_location(name: str, lat: float, lon: float, desc: str) -> str:
    """Create location via CLI and return ID"""
    cmd = [
        "./skraak", "create", "location",
        "--db", DB_PATH,
        "--dataset", DATASET_ID,
        "--name", name,
        "--lat", str(lat),
        "--lon", str(lon),
        "--timezone", TIMEZONE,
        "--description", desc
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    return json.loads(result.stdout)['location']['id']
def main():
    # Parse GPS logs
    gps_data = {}
    for log_file in GPS_LOGS:
        result = parse_gps_log(os.path.join(BASE_PATH, log_file))
        if result:
            folder = log_file.split('/')[0]
            gps_data[folder] = {
                'lat': result[0],
                'lon': result[1],
                'fixes': result[2]
            }
    # Process locations
    location_map = {}
    for folder in FOLDERS:
        folder_path = os.path.join(BASE_PATH, folder)
        # Determine coordinates and description
        if folder in gps_data:
            lat = gps_data[folder]['lat']
            lon = gps_data[folder]['lon']
            desc = f"Averaged from {gps_data[folder]['fixes']} GPS readings"
        else:
            lat = DEFAULT_LAT
            lon = DEFAULT_LON
            desc = "Location assumed to be Takaka pending GPS data"
        # Create location
        location_id = create_location(folder, lat, lon, desc)
        # Get metadata
        file_count = count_wav_files_recursive(folder_path)
        sample_rate = get_sample_rate_recursive(folder_path)
        location_map[folder] = {
            'id': location_id,
            'files': file_count,
            'sample_rate': sample_rate
        }
    # Generate CSV (use manual writes, not csv module, to avoid \r\n line endings)
    csv_path = os.path.join(BASE_PATH, "import.csv")
    with open(csv_path, 'w') as f:
        f.write("location_name,location_id,directory_path,date_range,sample_rate,file_count\n")
        for folder, info in location_map.items():
            f.write(f"{folder},{info['id']},{BASE_PATH}/{folder},2025-02,{info['sample_rate']},{info['files']}\n")
    print(f"✓ Created {len(location_map)} locations")
    print(f"✓ Generated CSV: {csv_path}")
if __name__ == '__main__':
    main()
```
## Common Patterns
### Date Format Detection
```python
# YYYYMMDD: 20231115_050004
# DDMMYY:   011123_050004
# YYMMDD:   231101_050004
# Heuristic: If first 4 chars > 2000, likely YYYYMMDD
# Can use variance to disambiguate YYMMDD/DDMMYY (many days, fewer years)
```
### Multiple Sample Rates
```python
# Don't assume uniform - check each location
# Common: 8000 Hz, 16000 Hz, 32000 Hz, 250000 Hz
```
### Trip References in Names
```python
# Location names may include trip references:
# foc-thup-foc01  → Friends of Cobb, trip 1
# foc-thup-fof11  → Friends of Flora, trip 11
```
## Troubleshooting
### Production Database Locked
**Solution**: Use backup database for queries, create locations later:
```python
# Step 1: Query backup for matches
# Step 2: Prepare CSV with analysis
# Step 3: Check with user
# Step 4: Wait for prod unlock
# Step 5: Check prod agrees with backup. 
# Step 5: Create locations in prod
# Step 6: Update CSV with real IDs
```
### Ambiguous Fuzzy Matches
**Problem**: One database location matches multiple folders
**Solution**: Be conservative - create new locations for each folder
```python
# foc-thup-f could match:
#   - foc-thup-foc01
#   - foc-thup-fof11
# create new locations for each folder
```
### Zero-Byte Files
**Handling**: Count them but note in logs
```python
file_count = len(list(directory.glob("*.wav")))
zero_byte_files = [f for f in directory.glob("*.wav") if f.stat().st_size == 0]
if zero_byte_files:
    log(f"  ⚠ {len(zero_byte_files)} zero-byte files found", log_file)
```
## Key Lessons (From Production Imports)
### 🔴 CRITICAL: Stricter Fuzzy Matching (0.9 threshold)
**Problem:** Threshold of 0.80 caused false matches:
- `mok_bl6P` matched to `mok_bl16` (score: 0.86) → WRONG! Different locations
- `mok_tuna51` matched to `mok_tuna31` (score: 0.89) → WRONG! Different locations
**Solution:** Use 0.9 threshold:
```python
if best_score > 0.9:  # NOT 0.80!
    return best_match
```
### 🔴 CRITICAL: Always Search Recursively
**Problem:** Using `glob("*.wav")` missed files in subdirectories.
**Solution:** Use `os.walk()` everywhere:
```python
for root, dirs, files in os.walk(folder_path):  # NOT glob()!
    for file in files:
        if file.lower().endswith('.wav'):
            # process file
```
**Applies to:**
- File counting
- Sample rate detection
- Date range parsing
### 🔴 CRITICAL: Handle Encoding Errors
**Problem:** Some GPS log files contain non-UTF-8 bytes → UnicodeDecodeError
**Solution:** Always use `errors='ignore'`:
```python
with open(log_file, 'r', encoding='utf-8', errors='ignore') as f:  # errors='ignore'!
```
### 🟡 IMPORTANT: Remove Suffixes in Normalization
Normalize location names to handle suffix variations:
```python
name = re.sub(r'(low|high)$', '', name.lower())  # Remove Low/High
# mok_bl6P_Low → mok_bl6p (for matching)
```
### 🟡 IMPORTANT: All-in-One Python Script
For complex imports (10+ locations), create a single Python script that:
1. Parses GPS logs
2. Counts files recursively
3. Gets metadata
4. Creates locations via CLI
5. Generates CSV
**Benefits:**
- Single point of failure
- Easy to debug
- Reusable for similar imports
- Complete audit trail
## Guidelines
- Always check existing locations before creating new ones (avoid duplicates)
- Use **strict** fuzzy matching (0.9 threshold) to avoid false matches
- **Always search recursively** with `os.walk()` (files in subfolders!)
- Prepare CSV before creating locations (allows user review)
- Never create locations without explicit user confirmation
- Write comprehensive logs for debugging and records
- Support multiple date formats (YYYYMMDD, DDMMYY and YYMMDD)
- Handle missing GPS gracefully with fallback coordinates
- **Handle encoding errors** with `errors='ignore'` when reading logs
- Verify results with SQL queries and CSV checks
- Provide ready-to-run bulk import command
## Success Criteria
✅ No duplicate locations created
✅ All folders matched or new locations planned
✅ CSV has correct format with all required columns
✅ Log file provides audit trail
✅ User confirmed before any database writes
✅ Verification queries show expected results
✅ Bulk import command provided

File deletion: example_prompt_for_human_reference.md

BF:BFD[14.19568] → [14.20155:20216]

BF:BFD[14.20216] → [14.19606:19606]

B:BD[14.19606] → [14.19607:20154]

Note skill: skraak-bulk-import-setup
# Bulk Import
  **Dataset:** [name]
  **Base folder:** [path]
  **WAV files:** [count]
  **Folders:** [count]
  **Ignore:** Clips_*
  **Recursive:** yes/no
  
  ## Folders
  [list]
  
  ## Log files (GPS)
  [list, or "none - assume Takaka -40.85085, 172.80703"]
  Note if multiple logs per folder (average coords)
  ## Output paths
  - CSV: [path]
  - Log: [path]
  - Python: save to base folder
  
  ## Import command
  ./skraak import bulk --db ./db/skraak.duckdb --dataset [id] --csv "[csv]" --log "[log]"

me, plus renamed skills

Dependencies

In channels

Change contents

Insertion in me.txt at line 1 [5.1]

Insertion in me.txt at line 7 [5.1]

Replacement in me.txt at line 135 [5.1]

Replacement in me.txt at line 141 [5.1]

Replacement in me.txt at line 144 [5.1]

Replacement in me.txt at line 146 [5.1]

Replacement in me.txt at line 149 [5.1]

Replacement in me.txt at line 155 [5.1]

Replacement in me.txt at line 160 [5.1]

Replacement in me.txt at line 173 [5.1]

Replacement in me.txt at line 175 [5.1]

Replacement in me.txt at line 177 [5.1]

Insertion in me.txt at line 247 [5.1]

Replacement in .ignore at line 4 [9.1]

Deletion in .ignore at line 9 [9.1]

Insertion in .ignore at line 11 [9.1]

File deletion: skraak-propagate-call-classification

File deletion: SKILL.md

File deletion: skraak-check-call-classification

File deletion: SKILL.md

File deletion: Roroa

File deletion: Duet

File deletion: Female

File deletion: Male

File deletion: skraak-import-segments

File deletion: SKILL.md

File deletion: skraak-dataset-report

File deletion: SKILL.md

File deletion: skraak-data-mapping

File deletion: SKILL.md

File deletion: skraak-bulk-import-setup

File deletion: SKILL.md

File deletion: example_prompt_for_human_reference.md