quietlight/skraak_mcp: CLAUDE.md

# Claude's Notes - Skraak MCP Server

This file contains important reminders and best practices for working with the Skraak MCP Server codebase.

## 🚨 CRITICAL DATABASE SAFETY WARNING 🚨

### ALWAYS USE TEST DATABASE FOR TESTING

**⚠️ EXTREMELY IMPORTANT**: When testing shell scripts or any end-to-end functionality, **ALWAYS** use the test database, **NEVER** the production database!

**CORRECT - Use test database:**
```bash
cd shell_scripts
./test_sql.sh ../db/test.duckdb > test.txt 2>&1
./test_resources_prompts.sh ../db/test.duckdb | jq '.'
```

**WRONG - DO NOT USE production database:**
```bash
# ❌ NEVER DO THIS - WILL CORRUPT PRODUCTION DATABASE
./test_sql.sh ../db/skraak.duckdb
./test_sql.sh  # Uses skraak.duckdb by default - DANGEROUS!
```

**Why this matters:**
- `db/skraak.duckdb` is the **PRODUCTION** database with real data
- `db/test.duckdb` is the **TEST** database for safe testing
- Even though the database is read-only, repeated connections during testing can cause lock issues
- DuckDB may create temporary files (.wal, .tmp) that can interfere with production access
- Test scripts make many rapid connections that can stress the database

**Default behavior:**
- All shell scripts default to `../db/skraak.duckdb` if no argument is provided
- **YOU MUST EXPLICITLY SPECIFY** `../db/test.duckdb` when testing
- Better yet: Always pipe to files to avoid accidents

**Safe testing workflow:**
```bash
cd shell_scripts

# Always specify test database explicitly
./test_sql.sh ../db/test.duckdb > test.txt 2>&1
./test_resources_prompts.sh ../db/test.duckdb > test_resources.txt 2>&1
./test_all_prompts.sh ../db/test.duckdb > test_prompts.txt 2>&1

# Then analyze results
rg '"result":' test.txt | wc -l
rg '"isError":true' test.txt | wc -l
```

## ⚠️ CRITICAL TESTING REMINDER

### Running Test Scripts with Large Output

**NEVER** run test scripts directly without piping to a file. Large outputs can cause token overflow.

**CORRECT APPROACH:**
```bash
# All shell scripts are in shell_scripts/ directory
cd shell_scripts

# ALWAYS use test database and pipe output to file
./test_sql.sh ../db/test.duckdb > test.txt 2>&1

# Then use targeted searches to verify results
rg -i "error" test.txt                    # Check for errors
rg '"result":' test.txt | wc -l           # Count successful responses
rg '"isError":true' test.txt | wc -l      # Count validation errors (expected)
```

**WRONG APPROACH:**
```bash
# ❌ DON'T DO THIS - may crash with massive output
cd shell_scripts && ./test_sql.sh
```

### Available Test Scripts

**IMPORTANT: All shell scripts are located in the `shell_scripts/` directory** to keep the project organized.

All test scripts accept an optional database path argument and **default to `../db/test.duckdb`** for safety!
- Default: `../db/test.duckdb` ✅ **SAFE - Use for testing**
- Production: `../db/skraak.duckdb` ⚠️ **Only use in production**

**Core functionality:**
1. **get_time.sh** - Quick test of get_current_time tool (no database needed)
2. **test_sql.sh [db_path]** - Tests execute_sql tool with various queries
   - Tests: simple SELECT, parameterized queries, JOINs, aggregates, security validation
   - Always pipe to file and use test database!

**Write tools (create/update):**
3. **test_tools.sh [db_path]** - Comprehensive test of all 4 create_or_update tools
   - Tests: create_or_update_dataset, create_or_update_location, create_or_update_cluster, create_or_update_pattern
   - Tests both create mode (no id) and update mode (with id)
   - Tests both valid inputs (should succeed) and invalid inputs (should fail)

**Import tools:**
4. **test_import_file.sh [db_path]** - Tests import_file tool (single file import)
5. **test_import_selections.sh [db_path]** - Tests import_ml_selections tool setup
6. **test_bulk_import.sh [db_path]** - Tests bulk_file_import tool (CSV-based bulk import)

**Resources and prompts:**
7. **test_resources_prompts.sh [db_path]** - Tests resources and prompts
8. **test_all_prompts.sh [db_path]** - Tests all 6 prompts

### Verifying Test Success

After piping to test.txt, check for:
```bash
# Count successful responses (should equal number of successful tests)
rg '"result":' test.txt | wc -l

# Count validation errors (expected for security tests)
rg '"isError":true' test.txt | wc -l

# No unexpected errors
rg -i '"error"' test.txt | grep -v '"isError"'
```

## Project Overview

### Architecture

The Skraak MCP Server is a Model Context Protocol (MCP) server written in Go that provides a **generic SQL query interface** for an acoustic monitoring system. It follows MCP's three-primitive architecture with an LLM-friendly design:

- **Tools** (model-controlled): Generic SQL query execution + time utility
- **Resources** (application-driven): Full database schema for context
- **Prompts** (user-controlled): SQL workflow templates that teach query patterns

### Philosophy: Schema + Generic SQL > Specialized Tools

**Why Generic SQL:**
- LLMs can construct any query given the schema (infinite flexibility)
- No rigid tool APIs to learn (just SQL)
- Full SQL expressiveness: JOINs, aggregates, CTEs, subqueries
- Prompts teach SQL patterns instead of tool calling

**Previous specialized tools were limiting:**
- Each tool = one fixed query
- Couldn't filter beyond hardcoded parameters
- Couldn't JOIN tables or use aggregates
- Created artificial boundaries

**With schema + generic SQL:**
- Infinite query possibilities
- LLM constructs appropriate query for each question
- Full DuckDB SQL feature set available
- More aligned with MCP philosophy (provide context, not APIs)

### Directory Structure

```
skraak_mcp/
├── main.go                       # Server entry point, tool registration
├── db/
│   ├── db.go                     # Database connection (read-only mode)
│   ├── types.go                  # Type definitions
│   ├── schema.sql                # Database schema (348 lines)
│   ├── skraak.duckdb             # Production database ⚠️
│   └── test.duckdb               # Test database (use for testing) ✅
├── tools/                        # Tool implementations
│   ├── time.go                   # get_current_time (utility tool)
│   ├── sql.go                    # execute_sql (generic query tool)
│   ├── import_files.go           # import_audio_files (batch WAV import)
│   ├── import_file.go            # import_file (single WAV file import)
│   ├── import_ml_selections.go   # import_ml_selections (ML detection import)
│   ├── bulk_file_import.go       # bulk_file_import (CSV-based bulk import)
│   ├── dataset.go                # create_or_update_dataset
│   ├── location.go               # create_or_update_location
│   ├── cluster.go                # create_or_update_cluster
│   └── pattern.go                # create_or_update_pattern
├── resources/
│   └── schema.go                 # Schema resources (full & per-table)
├── prompts/
│   └── examples.go               # SQL workflow templates (6 prompts)
├── utils/                        # Utility functions for file imports
│   ├── astronomical.go           # Solar/civil night, moon phase calculations
│   ├── astronomical_test.go      # Tests (11 test cases)
│   ├── audiomoth_parser.go       # AudioMoth WAV comment parsing
│   ├── audiomoth_parser_test.go  # Tests (36 test cases)
│   ├── filename_parser.go        # Filename timestamp parsing + timezone
│   ├── filename_parser_test.go   # Tests (60 test cases)
│   ├── selection_parser.go       # ML selection filename/folder parsing
│   ├── selection_parser_test.go  # Tests (34 test cases)
│   ├── wav_metadata.go           # WAV file header parsing
│   ├── wav_metadata_test.go      # Tests (22 test cases)
│   ├── xxh64.go                  # XXH64 hash computation
│   └── xxh64_test.go             # Tests (6 test cases)
└── shell_scripts/                # Shell test scripts (end-to-end)
    ├── get_time.sh               # Time tool test (no database)
    ├── test_sql.sh               # SQL tool tests
    ├── test_tools.sh             # All write tools tests (create/update)
    ├── test_import_file.sh       # Single file import tests
    ├── test_import_selections.sh # ML selection import tests
    ├── test_bulk_import.sh       # Bulk file import tests
    ├── test_resources_prompts.sh # Resources/prompts tests
    ├── test_all_prompts.sh       # All 6 prompts tests
    └── TESTING.md                # Comprehensive testing documentation
```

## Available Tools

### Time Tool
- `get_current_time` - Returns current system time with timezone and Unix timestamp

### Generic SQL Query Tool
- `execute_sql` - Execute arbitrary SQL SELECT queries
  - **Supports**: SELECT, WITH (CTEs), parameterized queries (? placeholders)
  - **Security**: Database is read-only (enforced by DuckDB), forbidden keyword validation
  - **Limits**: Default 1000 rows (max 10000) to prevent overwhelming responses
  - **Output**: Generic JSON results with column metadata
  - **Use with**: Schema resources to construct any query you need

### Import Tools
- `import_audio_files` - Batch import WAV files from folder
  - Automatically parses AudioMoth and filename timestamps
  - Calculates XXH64 hashes, extracts metadata
  - Computes astronomical data (solar/civil night, moon phase)
  - Skips duplicates (by hash), imports in single transaction

- `import_file` - Import a single WAV file into the database
  - **Input**: Absolute path to WAV file, dataset/location/cluster IDs
  - **Processing**: Same as batch import (AudioMoth/filename timestamps, hash, metadata, astronomical data)
  - **Output**: Detailed file metadata including file_id, hash, duration, sample_rate, timestamps
  - **Duplicate detection**: Returns `is_duplicate=true` if file hash already exists
  - **Use case**: Import individual files without scanning folders
  - **Example**:
    ```json
    {
      "name": "import_file",
      "arguments": {
        "file_path": "/path/to/recording.wav",
        "dataset_id": "abc123xyz789",
        "location_id": "def456uvw012",
        "cluster_id": "ghi789rst345"
      }
    }
    ```
  - **Output**:
    ```json
    {
      "file_id": "nB3xK8pLm9qR5sT7uV2wX",
      "file_name": "recording.wav",
      "hash": "a1b2c3d4e5f6g7h8",
      "duration_seconds": 60.0,
      "sample_rate": 250000,
      "timestamp_local": "2024-01-15T20:30:00+13:00",
      "is_audiomoth": true,
      "is_duplicate": false,
      "processing_time": "250ms"
    }
    ```

- `import_ml_selections` - Import ML-detected kiwi call selections
  - **Input**: Folder structure `Clips_{filter_name}_{date}/Species/CallType/*.wav+.png`
  - **Parses**: Selection filenames `{base}-{start}-{end}.wav`
  - **Validates**: Filter, species, call types, files, selection bounds
  - **Two-pass file matching**: Exact match then date_time pattern match
  - **Inserts**: selection → label (species) → label_subtype (call type)
  - **Transaction**: All-or-nothing import with comprehensive error reporting

- `bulk_file_import` - Batch import WAV files across multiple locations/clusters using CSV
  - **Input**: CSV file (see format below)
  - **Auto-creates clusters**: Creates clusters if they don't exist for location/date_range combinations
  - **Progress logging**: Writes detailed progress to log file for real-time monitoring (use `tail -f`)
  - **Synchronous execution**: Processes locations sequentially, fail-fast on errors
  - **Summary statistics**: Returns counts for clusters, files, duplicates, errors
  - **Duplicate handling**: Skips files with duplicate hashes across all clusters
  - **Use cases**: Bulk import across many locations, automated pipelines, large-scale migration

  **CSV Format:**
  - **Header required:** First row must contain column names
  - **Columns (in order):**
    1. `location_name` - Human-readable location name (string, can have spaces)
    2. `location_id` - 12-character location ID from database (must exist)
    3. `directory_path` - Absolute path to folder containing WAV files
    4. `date_range` - Cluster name (e.g., "20240101-20240107" or any string)
    5. `sample_rate` - Sample rate in Hz (integer, e.g., 8000, 48000, 250000)
    6. `file_count` - Expected file count (integer, informational only)
  
  **Important:**
  - `date_range` becomes the cluster name in the database
  - If cluster already exists for location+date_range, it will be reused
  - All `location_id` values must exist in the database (use `execute_sql` to query)
  - Paths should be absolute (relative paths may fail)
  
  **Example CSV:**
  ```csv
  location_name,location_id,directory_path,date_range,sample_rate,file_count
  "MOK RW 05","Ucfh8ng4DuEa","/media/david/Data/MOK RW 05","20240706-20240714","8000","432"
  "MOK RW 06","rDmmSPsJvNtD","/media/david/Data/MOK RW 06","20240706-20240714","8000","432"
  "mokas__01","EsDkvXosAp4C","/media/david/Data/mokas__01","20240520-20240528","8000","432"
  ```
  
  **Tool Call Example:**
  ```json
  {
    "name": "bulk_file_import",
    "arguments": {
      "dataset_id": "abc123xyz789",
      "csv_path": "/path/to/import.csv",
      "log_file_path": "/path/to/progress.log"
    }
  }
  ```

### Create/Update Tools
- `create_or_update_dataset` - Create (omit id) or update (provide id) a dataset
- `create_or_update_location` - Create or update a location with GPS coordinates and timezone
- `create_or_update_cluster` - Create or update a cluster within a location
- `create_or_update_pattern` - Create or update a cyclic recording pattern (record/sleep cycle)

### Security

**Database is read-only** (db/db.go:27):
```go
readOnlyPath := dbPath + "?access_mode=read_only"
```

**Validation layers:**
1. Regex validation: Must start with SELECT or WITH
2. Forbidden keywords: Blocks INSERT/UPDATE/DELETE/DROP/CREATE/ALTER
3. Row limiting: Prevents overwhelming responses

All write operations are blocked at both database and validation levels.

## Resources

### Schema Resources
- `schema://full` - Complete 348-line database schema (SQL)
- `schema://table/{table_name}` - Individual table definitions

**Valid table names**: dataset, location, cluster, file, selection, label, species, species_group, genus, family_group, order_group, family, order, class, phylum, kingdom, kiwi_call, call, syllable, and more (see schema.sql)

## Prompts

Six SQL workflow templates that teach query patterns:

1. **query_active_datasets** - Dataset querying with SQL SELECT and GROUP BY
2. **explore_database_schema** - Interactive schema exploration (resource-based)
3. **explore_location_hierarchy** - Hierarchy navigation with SQL JOINs
4. **query_location_data** - Location analysis with SQL filtering and aggregates
5. **analyze_cluster_files** - File analysis with SQL aggregate functions
6. **system_status_check** - Comprehensive health check workflow

All prompts teach SQL patterns with complete examples.

## Example SQL Queries

### Basic Queries

**Get all active datasets:**
```sql
SELECT id, name, type, description, active
FROM dataset
WHERE active = true
ORDER BY type, name;
```

**Get locations for a dataset (parameterized):**
```json
{
  "query": "SELECT id, name, latitude, longitude FROM location WHERE dataset_id = ? AND active = true",
  "parameters": ["vgIr9JSH_lFj"]
}
```

### JOINs (Now Possible!)

**Dataset hierarchy with counts:**
```sql
SELECT
    d.name as dataset,
    COUNT(DISTINCT l.id) as location_count,
    COUNT(DISTINCT c.id) as cluster_count,
    COUNT(f.id) as file_count
FROM dataset d
LEFT JOIN location l ON d.id = l.dataset_id
LEFT JOIN cluster c ON l.id = c.location_id
LEFT JOIN file f ON c.id = f.cluster_id
WHERE d.active = true
GROUP BY d.name
ORDER BY d.name;
```

### Aggregates (Now Possible!)

**Cluster file statistics:**
```sql
SELECT
    COUNT(*) as total_files,
    SUM(duration) as total_duration,
    AVG(duration) as avg_duration,
    MIN(timestamp_local) as first_recording,
    MAX(timestamp_local) as last_recording,
    SUM(CASE WHEN maybe_solar_night THEN 1 ELSE 0 END) as night_files
FROM file
WHERE cluster_id = ? AND active = true;
```

### Complex Analysis (New Possibilities!)

**Geographic distribution:**
```sql
SELECT
    d.name as dataset,
    COUNT(DISTINCT l.id) as locations,
    AVG(l.latitude) as avg_latitude,
    AVG(l.longitude) as avg_longitude
FROM dataset d
LEFT JOIN location l ON d.id = l.dataset_id
WHERE d.active = true
GROUP BY d.name;
```

**Temporal coverage:**
```sql
SELECT
    DATE_TRUNC('day', timestamp_local) as day,
    COUNT(*) as recordings,
    SUM(duration) as total_seconds
FROM file
WHERE active = true
  AND timestamp_local >= '2024-01-01'
GROUP BY day
ORDER BY day
LIMIT 100;
```

## Database Information

### Database Path
Default: `./db/skraak.duckdb`

### Key Tables
- **dataset** - Project datasets (organise/test/train types)
- **location** - Recording locations with GPS coordinates (139 active locations)
- **cluster** - Grouped recordings at locations
- **file** - Individual audio files with metadata
- **label** - Annotations and classifications
- **species** - Taxonomy information

## Building and Running

### Build
```bash
go build -o skraak_mcp
```

### Run (stdio mode - waits for MCP protocol input)
```bash
./skraak_mcp ./db/skraak.duckdb
```

### Quick Tests
```bash
# Navigate to shell_scripts directory
cd shell_scripts

# Quick time check (no database needed)
./get_time.sh

# SQL tool tests (ALWAYS use test.duckdb and pipe to file!)
./test_sql.sh ../db/test.duckdb > test.txt 2>&1
rg '"result":' test.txt | wc -l  # Should show 8 responses (6 successful + 2 validations)
rg '"isError":true' test.txt | wc -l  # Should show 2 (security tests)

# Resources and prompts (use test.duckdb!)
./test_resources_prompts.sh ../db/test.duckdb > test_resources.txt 2>&1
cat test_resources.txt | jq '.'

# All prompts test (use test.duckdb!)
./test_all_prompts.sh ../db/test.duckdb > test_prompts.txt 2>&1
rg '"result":' test_prompts.txt | wc -l
```

## SQL Query Tips

### Using execute_sql Tool

**Basic query:**
```json
{
  "name": "execute_sql",
  "arguments": {
    "query": "SELECT * FROM dataset WHERE active = true"
  }
}
```

**Parameterized query (recommended for user input):**
```json
{
  "name": "execute_sql",
  "arguments": {
    "query": "SELECT * FROM location WHERE dataset_id = ?",
    "parameters": ["vgIr9JSH_lFj"]
  }
}
```

**With custom row limit:**
```json
{
  "name": "execute_sql",
  "arguments": {
    "query": "SELECT * FROM file WHERE active = true",
    "limit": 100
  }
}
```

### SQL Best Practices

1. **Always use WHERE active = true** for main tables (dataset, location, cluster, file)
2. **Use parameterized queries** (? placeholders) for filtering by IDs
3. **Use LEFT JOIN** to include parent records even if children don't exist
4. **Use COUNT(DISTINCT)** when joining to avoid double-counting
5. **Use LIMIT** to restrict large result sets
6. **Use DATE_TRUNC** to group temporal data
7. **Use CASE WHEN** for conditional aggregates (e.g., count night vs day files)

## Common Issues and Solutions

### Query Results Too Large
**Problem**: Query returns too many rows
**Solution**: Use LIMIT clause (default 1000, max 10000)

### Server Exits Immediately
**Normal behavior** - Server runs in stdio mode, waiting for JSON-RPC input

### No Response from Tool Call
**Check**: Must initialize connection first with `initialize` method before calling tools

### Database Connection Failed
**Check**: Database path exists and is readable

### SQL Syntax Error
**Check**: Query syntax, table names (use schema resources), column names

## Go Unit Testing

### Test Coverage

The project includes comprehensive unit tests for all utility packages with **91.5% code coverage**.

**Test files:**
- `utils/astronomical_test.go` - Astronomical calculations (solar/civil night, moon phase)
- `utils/audiomoth_parser_test.go` - AudioMoth WAV comment parsing
- `utils/filename_parser_test.go` - Filename timestamp parsing with timezone handling
- `utils/selection_parser_test.go` - ML selection filename/folder parsing
- `utils/wav_metadata_test.go` - WAV file metadata extraction
- `utils/xxh64_test.go` - XXH64 hash computation

**Total: 170+ tests covering:**
- Date format detection (YYYYMMDD, YYMMDD, DDMMYY)
- Variance-based disambiguation
- Timezone offset calculation with fixed-offset strategy
- DST transition handling
- UTC conversion correctness
- AudioMoth metadata parsing (all gain levels, temperature, battery)
- WAV header parsing (duration, sample rate, channels, INFO chunks)
- XXH64 hash validation
- ML selection filename parsing (base-start-end format)
- ML folder name parsing (Clips_filter_date format)
- WAV/PNG pair validation
- Date/time pattern extraction for fuzzy file matching
- Edge cases (invalid dates, leap years, case sensitivity)

### Running Go Tests

```bash
# Run all tests
go test ./...

# Run specific package
go test ./utils/

# Run with verbose output
go test -v ./utils/

# Run specific test
go test -v ./utils/ -run TestParseFilenameTimestamps

# Run with coverage report
go test -cover ./utils/

# Generate coverage profile
go test -coverprofile=coverage.out ./utils/
go tool cover -html=coverage.out
```

### Test Organization

Tests follow Go conventions:
- Test files named `*_test.go`
- Test functions named `Test*`
- Use table-driven tests where appropriate
- Include edge cases and error conditions
- Match TypeScript test suite from original project

**Key differences from TypeScript tests:**
- Go separates filename parsing from timezone application (better design)
- Go validates dates strictly (TypeScript's Date constructor auto-corrects)
- Console logging tests omitted (not applicable to MCP servers)
- All essential functionality covered with equivalent or better tests

## Claude Desktop Configuration

Add to `~/.config/Claude/claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "skraak_mcp": {
      "command": "/home/david/go/src/skraak_mcp/skraak_mcp",
      "args": ["/home/david/go/src/skraak_mcp/db/skraak.duckdb"]
    }
  }
}
```

Remember to restart Claude Desktop after configuration changes.

## Recent Changes

### Latest Update: Bulk File Import Cluster Assignment Bug Fix (2026-02-10)

**Critical Bug Fix: Files now correctly distributed across multiple clusters for same location**

**Problem:** When the same location appeared multiple times in the CSV with different date ranges, all files ended up in the **last cluster created** instead of being distributed across their respective clusters.

**Root Cause:** The `clusterIDMap` used only `LocationID` as the key, causing each new cluster for the same location to overwrite the previous one in the map.

**Example of Bug:**
```csv
A12,loc123,/path/2019,2019,8000,864
A12,loc123,/path/2020,2020,8000,180
A12,loc123,/path/2022,2022,8000,180
A12,loc123,/path/2024,2024,8000,549
```
- **Before fix:** 4 clusters created, ALL 1773 files go into 2024 cluster
- **After fix:** 4 clusters created, files distributed correctly (864 in 2019, 180 in 2020, etc.)

**Solution:** Changed map key from `LocationID` to composite key `LocationID|DateRange`.

**Modified:**
- `tools/bulk_file_import.go` (lines 125, 171-172, 183-184)
  - Line 125: Updated map comment to reflect composite key
  - Line 171-172: Store with `compositeKey := loc.LocationID + "|" + loc.DateRange`
  - Line 183-184: Retrieve with same composite key

**Impact:**
- ✅ **Data integrity restored:** Files now go to correct clusters
- ✅ **Multiple date ranges per location:** Now works correctly
- ✅ **No breaking changes:** Simple 3-line fix
- ✅ **Backwards compatible:** Single location CSV rows work identically

**Verification:**
```sql
SELECT
    l.name as location_name,
    c.name as cluster_name,
    COUNT(f.id) as file_count
FROM cluster c
JOIN location l ON c.location_id = l.id
LEFT JOIN file f ON f.cluster_id = c.id
WHERE l.name = 'A12'
GROUP BY c.id, l.name, c.name
ORDER BY c.name;
```

**Note:** Data previously imported with the buggy code will need to be re-imported to fix cluster assignments.

---

### Previous Update: File Modification Time Fallback (2026-02-07)

**Enhancement: Added file modification time as third timestamp fallback**

**Problem:** Small clusters (1-2 files) failed variance-based filename disambiguation because the algorithm needs multiple samples to determine date format (YYYYMMDD vs YYMMDD vs DDMMYY).

**Solution:** Added file modification time as third fallback in timestamp resolution chain.

**Timestamp Resolution Order:**
```
1. AudioMoth comment → timestamp
2. Filename parsing → timestamp  
3. File modification time → timestamp (NEW!)
4. FAIL (skip file with error)
```

**Modified:**
- `utils/cluster_import.go` - Added FileModTime fallback in `batchProcessFiles()`
  - Silent fallback (no warning logged)
  - Assumes FileModTime is in location timezone
  - Reduces import failures in small clusters

**Benefits:**
- ✅ **Fewer failures:** Small clusters (1-2 files) no longer fail when filename parsing can't disambiguate
- ✅ **No performance impact:** FileModTime already extracted in `ParseWAVHeader()`
- ✅ **Backwards compatible:** Only helps files that would have failed
- ✅ **Simple:** 10 lines of code, defensive checks, no complexity

**Use Case:** User has 1-2 files with unparseable filenames (e.g., `recording001.wav`) → Previously failed, now uses FileModTime.

**See Also:**
- `TIMESTAMP_FALLBACK_PLAN.md` - Complete implementation plan

---

### Previous Update: Cluster Import Logic Extraction (2026-02-07)

**Major refactoring: Extracted shared cluster import logic into utils module**

**Key Insight:** A cluster is the atomic unit of import (one SD card / one recording session / one folder).

**Created:**
- `utils/cluster_import.go` (553 lines) - Single source of truth for cluster imports
  - `ImportCluster()` - Main entry point used by both import_files.go and bulk_file_import.go
  - `scanClusterFiles()` - Recursive WAV file scanning
  - `batchProcessFiles()` - Batch processing with variance-based filename timestamp parsing
  - `insertClusterFiles()` - Transactional database insertion
  - Moved `FileImportError` type from tools/ to utils/

**Modified:**
- `tools/import_files.go` - **75% code reduction** (650 lines → 161 lines)
  - Now just calls `utils.ImportCluster()` for all the heavy lifting
  - Removed ~500 lines of duplicated logic
- `tools/bulk_file_import.go` - **Bug fixes + simplification**
  - **🐛 CRITICAL BUG FIXED:** Now inserts into `file_dataset` table (was missing!)
  - **🐛 CRITICAL BUG FIXED:** Now inserts into `moth_metadata` table (was missing!)
  - Now uses shared `utils.ImportCluster()` logic
  - Files are no longer orphaned from datasets
- `tools/import_file.go` - Added helper wrappers for compatibility

**Benefits:**
- ✅ **Bug Fixed:** 68,043 orphaned files found in test database (confirms bug was real)
- ✅ **Single source of truth:** All cluster import logic in one place
- ✅ **Code reduction:** ~500 lines of duplicated code eliminated
- ✅ **Consistency:** Both single-cluster and multi-cluster imports use identical logic
- ✅ **Maintainability:** Changes to import logic made in one place
- ✅ **Performance:** No regression, same batch processing as before

**Architecture:**
```
Before:
  tools/import_files.go (650 lines) - Custom logic
  tools/bulk_file_import.go (460 lines) - Different logic (BUGGY)

After:
  utils/cluster_import.go (553 lines) - Shared logic
  tools/import_files.go (161 lines) - Calls utils.ImportCluster()
  tools/bulk_file_import.go (393 lines) - Calls utils.ImportCluster()
```

**See Also:**
- `plan.md` - Complete refactoring plan with implementation checklist
- `REFACTORING_SUMMARY.md` - Detailed summary of changes
- `VERIFICATION_RESULTS.md` - Test results and database analysis

---

### Previous Update: Generic SQL Tool + Codebase Rationalization (2026-01-26)

**Major architectural change: Replaced 6 specialized tools with generic SQL approach**

**Deleted:**
- `tools/dataset.go` - query_datasets tool
- `tools/location.go` - query_locations, query_locations_by_dataset tools
- `tools/cluster.go` - query_clusters, query_clusters_by_location tools
- `tools/file.go` - query_files_by_cluster tool
- `shell_scripts/test_new_tools.sh` - Obsolete test script
- `shell_scripts/test_mcp.sh` - Obsolete test script

**Added:**
- `tools/sql.go` - Generic execute_sql tool (~200 lines)
- `shell_scripts/test_sql.sh` - Comprehensive SQL test suite

**Modified:**
- `main.go` - Removed 6 tool registrations, kept only get_current_time and execute_sql
- `prompts/examples.go` - Completely rewritten to teach SQL patterns instead of tool calls
- All 6 prompts now include SQL examples with SELECT, JOIN, GROUP BY, aggregates

**Benefits:**
- Full SQL expressiveness (JOINs, aggregates, CTEs, subqueries) - **previously impossible**
- Infinite query possibilities vs 6 fixed queries
- More aligned with MCP philosophy (context over APIs)
- LLMs can answer any question given the schema
- Smaller codebase (2 tools instead of 8)
- More maintainable (no new tool for each query pattern)

**Security:**
- Database already read-only (verified in db/db.go)
- Validation layers block write operations
- Parameterized queries prevent SQL injection
- Row limits prevent overwhelming responses

**Migration Notes:**
- Old tool calls must be replaced with SQL queries
- All old functionality is still available via SQL
- Prompts provide SQL examples for common patterns
- Schema resources provide full context for query construction

### Previous Update: Shell Scripts Organization (2026-01-26)
- Reorganized all shell scripts into `shell_scripts/` directory
- Keeps project root clean and organized
- All scripts updated with correct relative paths

### Latest Update: Comprehensive Go Unit Testing (2026-01-28)

**Added comprehensive unit test suite for utility packages**

**Added:**
- `utils/astronomical_test.go` - 11 test cases for astronomical calculations
- `utils/audiomoth_parser_test.go` - 36 test cases for AudioMoth parsing
- `utils/filename_parser_test.go` - 60 test cases for filename/timezone parsing
- `utils/wav_metadata_test.go` - 22 test cases for WAV metadata extraction
- `utils/xxh64_test.go` - 6 test cases for hash computation

**Test Coverage:**
- **Total: 136 tests**
- **Coverage: 91.5%** of statements
- All tests ported from TypeScript test suite
- Additional Go-specific tests for date validation

**Key Test Areas:**
- Filename parsing: YYYYMMDD, YYMMDD, DDMMYY formats with variance-based disambiguation
- Timezone handling: Fixed-offset strategy, DST transitions (Auckland, US timezones)
- UTC conversion: Mathematical correctness validation
- AudioMoth: Comment parsing, all gain levels, timezone formats
- WAV metadata: Duration, sample rate, INFO chunks
- Astronomical: Solar/civil night, moon phase calculations
- ML selection parsing: Filename format, folder structure, WAV/PNG pairing
- Edge cases: Invalid dates, leap years, case sensitivity

### Latest Update: ML Selection Import Tool (2026-01-29)

**New Feature: Import ML-detected kiwi call selections from folder structure**

**Added:**
- `utils/selection_parser.go` - Selection filename and folder parsing utilities
- `utils/selection_parser_test.go` - Comprehensive tests (34 test cases)
- `tools/import_ml_selections.go` - MCP tool implementation (~1050 lines)
- `shell_scripts/test_import_selections.sh` - Integration test script

**Features:**
- **Folder structure support**: `Clips_{filter_name}_{date}/Species/CallType/*.wav+.png`
- **Filename parsing**: Extracts `{base}-{start}-{end}.wav` format
- **Two-pass file matching**:
  1. Exact match by filename
  2. Fuzzy match by date_time pattern (handles prefix variations)
- **Comprehensive validation**:
  - Filter exists in database
  - Species linked to dataset
  - Call types exist for species
  - Files exist in cluster
  - Selection bounds within file duration
- **Transactional import**: All-or-nothing with error collection
- **Database relations**: selection → label (species) → label_subtype (call type)

**Usage Example:**
```bash
# Folder structure:
# Clips_opensoundscape-kiwi-1.0_2025-11-14/
#   └── Brown Kiwi/
#       ├── Male - Solo/
#       │   ├── A05-20250517_214501-102-133.wav
#       │   └── A05-20250517_214501-102-133.png
#       └── Female - Solo/
#           └── ...

# MCP tool call:
{
  "name": "import_ml_selections",
  "arguments": {
    "folder_path": "/path/to/Clips_opensoundscape-kiwi-1.0_2025-11-14",
    "dataset_id": "abc123xyz789",
    "cluster_id": "def456uvw012"
  }
}
```

**Validation Features:**
- Batch validation of all entities before any database writes
- Comprehensive error reporting (filename, species, call type, stage)
- Fuzzy file matching handles filename prefix variations
- Strict selection bounds checking (end_time ≤ file.duration)
- Ambiguous match detection (multiple files with same date_time pattern)

**Test Coverage:**
- 34 unit tests for selection parsing utilities
- Tests for various filename formats (with/without dashes, decimal times)
- Tests for folder name parsing (filter + date extraction)
- Tests for WAV/PNG pair validation
- Tests for date_time pattern extraction (8-digit and 6-digit formats)

**Tool Count Update**: Now **7 total tools** (read: 2, write: 4, import: 2)
*Note: Tool count increased to 10 in later updates (import: 4)*

### Latest Update: Single File Import Tool (2026-02-02)

**New Feature: Import individual WAV files with `import_file` tool**

**Added:**
- `tools/import_file.go` - Single file import implementation (~300 lines)
- `shell_scripts/test_import_file.sh` - Integration test script

**Features:**
- **Single file import**: Import one WAV file at a time with detailed feedback
- **Same processing pipeline**: Reuses all utilities from batch import (AudioMoth parsing, timestamp extraction, hash computation, astronomical calculations)
- **Shared helper functions**: Reuses `validateImportInput()`, `getLocationData()`, `ensureClusterPath()` from import_files.go
- **Detailed output**: Returns file_id, hash, duration, sample_rate, timestamps, processing time
- **Duplicate detection**: Checks hash before insertion, returns `is_duplicate=true` if exists
- **Fail-fast errors**: Single file import is atomic - succeeds completely or fails with clear error message

**Input:**
```json
{
  "file_path": "/absolute/path/to/file.wav",
  "dataset_id": "12-char-id",
  "location_id": "12-char-id",
  "cluster_id": "12-char-id"
}
```

**Output:**
```json
{
  "file_id": "21-char-nanoid",
  "file_name": "filename.wav",
  "hash": "16-char-xxh64-hex",
  "duration_seconds": 60.0,
  "sample_rate": 250000,
  "timestamp_local": "2024-01-15T20:30:00+13:00",
  "is_audiomoth": true,
  "is_duplicate": false,
  "processing_time": "250ms"
}
```

**Use Cases:**
- Import files one at a time with detailed feedback per file
- Programmatic import where you already know the file path
- Import files from different locations without folder scanning
- Get immediate feedback on duplicate detection
- Alternative to batch import for small numbers of files

**Tool Count Update**: Now **8 total tools** (read: 2, write: 4, import: 3)
*Note: Tool count increased to 10 in later update (import: 4)*

### Latest Update: Bulk File Import Tool (2026-02-06)

**New Feature: CSV-based bulk import across multiple locations and clusters**

**Added:**
- `tools/bulk_file_import.go` - CSV-based bulk import implementation (~500 lines)

**Features:**
- **CSV-driven import**: Single CSV file specifies multiple locations, directories, and clusters
- **Auto-cluster creation**: Automatically creates clusters if they don't exist for location/date_range
- **Progress logging**: Real-time progress logging to file (monitor with `tail -f log_file.log`)
- **Synchronous/fail-fast**: Processes sequentially with immediate error reporting
- **Summary statistics**: Returns detailed counts for locations, clusters, files, duplicates, errors
- **Shared utilities**: Reuses all WAV processing utilities (AudioMoth, timestamps, hash, astronomical)

**CSV Format:**
```csv
location_name,location_id,directory_path,date_range,sample_rate,file_count
Site A,loc123456789,/path/to/recordings,2024-01,48000,150
Site B,loc987654321,/path/to/recordings,2024-02,250000,200
```

**Input:**
```json
{
  "dataset_id": "12-char-id",
  "csv_path": "/absolute/path/to/import.csv",
  "log_file_path": "/absolute/path/to/progress.log"
}
```

**Output:**
```json
{
  "total_locations": 10,
  "clusters_created": 5,
  "clusters_existing": 5,
  "total_files_scanned": 1500,
  "files_imported": 1200,
  "files_duplicate": 250,
  "files_error": 50,
  "processing_time": "5m30s",
  "errors": []
}
```

**Use Cases:**
- Bulk import across many locations in one operation
- Automated import pipelines with CSV generation
- Large-scale data migration from existing systems
- Batch processing with progress monitoring via log file

**Comparison with other import tools:**
- `import_audio_files`: Single folder → single cluster
- `import_file`: Single file → single cluster
- `import_ml_selections`: ML detection folder structure → selections
- `bulk_file_import`: CSV with multiple folders → multiple clusters (auto-creates)

**Tool Count Update**: Now **10 total tools** (read: 2, write: 4, import: 4)

### Latest Update: Test Script Consolidation (2026-02-06)

**Rationalized and consolidated shell test scripts for better organization**

**Removed redundant scripts:**
- `test_import_simple.sh` - Only tested registration (redundant)
- `test_import_tool.sh` - Incomplete, just schema validation
- `test_write_simple.sh` - Incomplete happy-path test
- `test_write_tools.sh` - Replaced by comprehensive test_tools.sh
- `test_write_e2e.sh` - Required manual ID replacement (not automated)
- `test_update_tools.sh` - Replaced by test_tools.sh

**Added comprehensive test scripts:**
- `test_tools.sh` - All 4 create_or_update tools (create + update modes) with validation
- `test_bulk_import.sh` - Tests bulk_file_import tool with CSV parsing

**Updated documentation:**
- `shell_scripts/TESTING.md` - Complete rewrite with current tool set
- Removed references to deleted tools (query_datasets, etc.)
- Added examples for all 14 current tools
- Added SQL query examples (JOINs, aggregates, temporal analysis)
- Added troubleshooting section and best practices

**Current test suite (8 scripts):**
1. `get_time.sh` - Time tool (no database)
2. `test_sql.sh` - SQL query tool (comprehensive)
3. `test_tools.sh` - All create_or_update tools (4 tools, create + update modes)
4. `test_import_file.sh` - Single file import
5. `test_import_selections.sh` - ML selection import
6. `test_bulk_import.sh` - Bulk CSV import
7. `test_resources_prompts.sh` - Resources/prompts
8. `test_all_prompts.sh` - All 6 prompts

**Benefits:**
- Cleaner shell_scripts directory (8 scripts vs 14)
- Better organization by functionality
- No redundant/incomplete tests
- Comprehensive coverage of all 10 tools
- Up-to-date documentation matching current codebase
- All tests default to test.duckdb for safety

### Latest Update: Tool Consolidation - 8 write/update tools → 4 create_or_update tools (2026-02-06)

**Consolidated 4 create_* + 4 update_* tools into 4 create_or_update_* tools**

**Deleted (8 files):**
- `tools/write_dataset.go`, `tools/write_location.go`, `tools/write_cluster.go`, `tools/write_pattern.go`
- `tools/update_dataset.go`, `tools/update_location.go`, `tools/update_cluster.go`, `tools/update_pattern.go`
- `tools/write_pattern_test.go`

**Added (4 files + 1 test):**
- `tools/dataset.go` - `create_or_update_dataset` (create when no id, update when id provided)
- `tools/location.go` - `create_or_update_location`
- `tools/cluster.go` - `create_or_update_cluster`
- `tools/pattern.go` - `create_or_update_pattern`
- `tools/pattern_test.go` - Updated tests for consolidated pattern tool

**Modified:**
- `main.go` - 8 tool registrations → 4
- `tools/integration_test.go` - Updated to use new ClusterInput/CreateOrUpdateCluster types
- `shell_scripts/test_tools.sh` - Updated to test 4 tools (both create and update modes)
- `shell_scripts/test_bulk_import.sh` - Updated tool names
- `shell_scripts/TESTING.md` - Updated documentation
- `CLAUDE.md` - Updated tool counts, directory structure, documentation

**Design:**
- Omit `id` field → CREATE mode (generates nanoid, inserts, returns entity)
- Provide `id` field → UPDATE mode (verifies exists, builds dynamic UPDATE, returns entity)
- Shared validation logic per entity (e.g., coordinate bounds, name length)
- Both modes now return the full entity (update previously only returned success boolean)

**Benefits:**
- Tool count reduced from 14 → 10 (fewer tools for LLM to reason about)
- File count reduced from 8 → 4 (fewer files to maintain)
- ~31% less code (~320 lines removed)
- Shared validation logic eliminates duplication
- Consistent return types (both modes return the entity)

---

**Last Updated**: 2026-02-10 NZDT
**Status**: Bulk import cluster assignment bug fixed 🐛✅
**Current Tools**: 10 (read: 2, write: 4, import: 4)
**Test Scripts**: 8 comprehensive shell scripts + verify_database_state.sh
**Test Coverage**: 170+ Go unit tests (91.5% coverage)
**Import Logic**: Centralized in utils/cluster_import.go (553 lines)
**Timestamp Fallback**: AudioMoth → Filename → FileModTime (reduces failures in small clusters)
**Code Quality**: ~500 lines of duplication eliminated (75% reduction in import_files.go)
**Current Database**: 139 locations, 8 active datasets, 1.19M files in test.duckdb
**Test Databases**: skraak.duckdb (production) ⚠️, test.duckdb (testing) ✅
**Known Issue**: test.duckdb has 68K orphaned files from old buggy import (historical data)