junhao li

Posted on Sep 8

Technical Deep Dive: Kantan Tools Character Counter (文字数) Implementation

Introduction

Character counting for Japanese text presents unique challenges that differ significantly from Latin-based languages. Kantan Tools provides a beautifully crafted collection of web utilities that immediately caught attention with their clean design and practical functionality, particularly their character counter tool. This article explores the technical implementation behind Kantan Tools' 文字数 (mojisuu) character counter, examining the algorithms and engineering approaches required for accurate Japanese text analysis.

The Japanese Text Complexity Challenge

Understanding Japanese Writing Systems

The modern Japanese writing system uses a combination of logographic kanji, which are adopted Chinese characters, and syllabic kana. Kana itself consists of a pair of syllabaries: hiragana, used primarily for native or naturalized Japanese words and grammatical elements; and katakana, used primarily for foreign words and names, loanwords, onomatopoeia, scientific names, and sometimes for emphasis.

The complexity arises from several factors:

Multiple Script Systems: In modern Japanese, the hiragana and katakana syllabaries each contain 46 basic characters, or 71 including diacritics
Mixed Text Composition: Almost all written Japanese sentences contain a mixture of kanji and kana
Character Variants: Extended character sets include half-width katakana, punctuation, and numerical representations

Unicode Ranges for Japanese Characters

The technical foundation for Japanese character counting relies on Unicode character ranges:

// Core Unicode ranges for Japanese text processing
const JAPANESE_UNICODE_RANGES = {
  hiragana: {
    basic: [0x3040, 0x309F],        // U+3040-U+309F
    extended: [0x1F200, 0x1F200],   // Enclosed Ideographic Supplement
    smallKana: [0x1B132, 0x1B152]   // Small Kana Extension
  },
  katakana: {
    basic: [0x30A0, 0x30FF],        // U+30A0-U+30FF
    halfWidth: [0xFF65, 0xFF9F],    // Half-width katakana
    supplement: [0x31F0, 0x31FF]    // Katakana Phonetic Extensions
  },
  kanji: {
    cjkUnified: [0x4E00, 0x9FFF],   // CJK Unified Ideographs
    extension: [0x3400, 0x4DBF],    // CJK Extension A
    compatibility: [0xF900, 0xFAFF], // CJK Compatibility Ideographs
    extensionB: [0x20000, 0x2A6DF]  // CJK Extension B (rare)
  }
};

Core Algorithm Implementation

Character Classification Engine

The goal was to create a Japanese-optimized text counter that would go beyond basic character counting. Basic Stats: Character count, word count, line count · Japanese-Specific: Separate counts for ひらがな (Hiragana), カタカナ (Katakana), and 漢字 (Kanji) Practical Metrics: Manuscript paper calculation, byte count in various encodings

interface CharacterAnalysis {
  totalCharacters: number;
  hiragana: number;
  katakana: number;
  kanji: number;
  punctuation: number;
  numbers: number;
  latin: number;
  whitespace: number;
  lineCount: number;
  byteCount: {
    utf8: number;
    utf16: number;
    sjis: number;
  };
  manuscriptPages: number;
}

class JapaneseCharacterCounter {
  private unicodeRanges: typeof JAPANESE_UNICODE_RANGES;

  constructor() {
    this.unicodeRanges = JAPANESE_UNICODE_RANGES;
  }

  /**
   * Analyzes Japanese text and returns comprehensive character statistics
   * Uses Unicode code point analysis for accurate classification
   */
  analyzeText(text: string): CharacterAnalysis {
    const analysis: CharacterAnalysis = {
      totalCharacters: 0,
      hiragana: 0,
      katakana: 0,
      kanji: 0,
      punctuation: 0,
      numbers: 0,
      latin: 0,
      whitespace: 0,
      lineCount: 1,
      byteCount: { utf8: 0, utf16: 0, sjis: 0 },
      manuscriptPages: 0
    };

    // Use Array.from to handle surrogate pairs correctly
    const characters = Array.from(text);

    for (const char of characters) {
      const codePoint = char.codePointAt(0);
      if (!codePoint) continue;

      analysis.totalCharacters++;

      // Classify character by Unicode ranges
      if (this.isHiragana(codePoint)) {
        analysis.hiragana++;
      } else if (this.isKatakana(codePoint)) {
        analysis.katakana++;
      } else if (this.isKanji(codePoint)) {
        analysis.kanji++;
      } else if (this.isWhitespace(char)) {
        analysis.whitespace++;
      } else if (this.isNumber(char)) {
        analysis.numbers++;
      } else if (this.isLatin(codePoint)) {
        analysis.latin++;
      } else if (this.isPunctuation(codePoint)) {
        analysis.punctuation++;
      }

      // Count line breaks
      if (char === '\n') {
        analysis.lineCount++;
      }
    }

    // Calculate byte counts for different encodings
    analysis.byteCount = this.calculateByteCount(text);

    // Calculate manuscript paper equivalents (400 chars per page)
    analysis.manuscriptPages = Math.ceil(analysis.totalCharacters / 400);

    return analysis;
  }

  /**
   * Determines if character is Hiragana
   * Includes basic range and extended Unicode blocks
   */
  private isHiragana(codePoint: number): boolean {
    const ranges = this.unicodeRanges.hiragana;
    return (
      this.inRange(codePoint, ranges.basic) ||
      this.inRange(codePoint, ranges.extended) ||
      this.inRange(codePoint, ranges.smallKana)
    );
  }

  /**
   * Determines if character is Katakana
   * Handles both full-width and half-width katakana
   */
  private isKatakana(codePoint: number): boolean {
    const ranges = this.unicodeRanges.katakana;
    return (
      this.inRange(codePoint, ranges.basic) ||
      this.inRange(codePoint, ranges.halfWidth) ||
      this.inRange(codePoint, ranges.supplement)
    );
  }

  /**
   * Determines if character is Kanji (Chinese characters)
   * Includes multiple CJK Unicode blocks
   */
  private isKanji(codePoint: number): boolean {
    const ranges = this.unicodeRanges.kanji;
    return (
      this.inRange(codePoint, ranges.cjkUnified) ||
      this.inRange(codePoint, ranges.extension) ||
      this.inRange(codePoint, ranges.compatibility) ||
      this.inRange(codePoint, ranges.extensionB)
    );
  }

  private inRange(codePoint: number, range: number[]): boolean {
    return codePoint >= range[0] && codePoint <= range[1];
  }

  private isWhitespace(char: string): boolean {
    return /\s/.test(char);
  }

  private isNumber(char: string): boolean {
    // Includes both ASCII and full-width numbers
    return /[\d０-９]/.test(char);
  }

  private isLatin(codePoint: number): boolean {
    return (
      (codePoint >= 0x0041 && codePoint <= 0x005A) || // A-Z
      (codePoint >= 0x0061 && codePoint <= 0x007A) || // a-z
      (codePoint >= 0xFF21 && codePoint <= 0xFF3A) || // Full-width A-Z
      (codePoint >= 0xFF41 && codePoint <= 0xFF5A)    // Full-width a-z
    );
  }

  private isPunctuation(codePoint: number): boolean {
    // Japanese punctuation and symbols
    return (
      (codePoint >= 0x3000 && codePoint <= 0x303F) || // CJK Symbols and Punctuation
      (codePoint >= 0xFF00 && codePoint <= 0xFF0F) || // Full-width ASCII variants
      (codePoint >= 0xFF1A && codePoint <= 0xFF20) || // Full-width punctuation
      (codePoint >= 0xFF3B && codePoint <= 0xFF40) || // More full-width punctuation
      (codePoint >= 0xFF5B && codePoint <= 0xFF65)    // Additional symbols
    );
  }

  /**
   * Calculates text size in different character encodings
   * Critical for document formatting and system compatibility
   */
  private calculateByteCount(text: string): CharacterAnalysis['byteCount'] {
    const encoder = new TextEncoder();

    return {
      utf8: encoder.encode(text).length,
      utf16: text.length * 2, // Simplified calculation
      sjis: this.estimateSJISByteCount(text)
    };
  }

  private estimateSJISByteCount(text: string): number {
    let byteCount = 0;

    for (const char of Array.from(text)) {
      const codePoint = char.codePointAt(0);
      if (!codePoint) continue;

      if (codePoint <= 0x7F) {
        byteCount += 1; // ASCII characters
      } else if (this.isHalfWidthKatakana(codePoint)) {
        byteCount += 1; // Half-width katakana
      } else {
        byteCount += 2; // Most Japanese characters
      }
    }

    return byteCount;
  }

  private isHalfWidthKatakana(codePoint: number): boolean {
    return codePoint >= 0xFF65 && codePoint <= 0xFF9F;
  }
}

Performance Optimization Strategies

Real-time Processing Architecture

For responsive user experience, the character counter must process text in real-time as users type:

class OptimizedCharacterCounter {
  private worker: Worker;
  private debounceTimer: number | null = null;
  private cache: Map<string, CharacterAnalysis> = new Map();

  constructor() {
    this.initializeWorker();
  }

  /**
   * Debounced text analysis to prevent excessive computation
   * Uses Web Workers for non-blocking processing
   */
  analyzeTextAsync(text: string, callback: (result: CharacterAnalysis) => void): void {
    // Clear previous debounce timer
    if (this.debounceTimer) {
      clearTimeout(this.debounceTimer);
    }

    // Check cache first
    const cacheKey = this.hashText(text);
    if (this.cache.has(cacheKey)) {
      callback(this.cache.get(cacheKey)!);
      return;
    }

    // Debounce the analysis
    this.debounceTimer = window.setTimeout(() => {
      this.worker.postMessage({ text, cacheKey });
    }, 150); // 150ms debounce delay

    // Set up one-time listener for this analysis
    const handleMessage = (event: MessageEvent) => {
      if (event.data.cacheKey === cacheKey) {
        this.cache.set(cacheKey, event.data.result);
        callback(event.data.result);
        this.worker.removeEventListener('message', handleMessage);
      }
    };

    this.worker.addEventListener('message', handleMessage);
  }

  private initializeWorker(): void {
    // Worker script would contain the JapaneseCharacterCounter class
    this.worker = new Worker('/js/workers/character-counter-worker.js');

    this.worker.onerror = (error) => {
      console.error('Character counter worker error:', error);
      // Fallback to main thread processing
      this.worker = null as any;
    };
  }

  private hashText(text: string): string {
    // Simple hash function for caching
    let hash = 0;
    for (let i = 0; i < text.length; i++) {
      const char = text.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return hash.toString(36);
  }

  /**
   * Cache management to prevent memory bloat
   */
  private cleanupCache(): void {
    if (this.cache.size > 100) {
      // Remove oldest 50% of entries
      const entries = Array.from(this.cache.entries());
      const toRemove = entries.slice(0, Math.floor(entries.length / 2));
      toRemove.forEach(([key]) => this.cache.delete(key));
    }
  }
}

Incremental Analysis for Large Texts

For processing large documents efficiently:

class IncrementalAnalyzer {
  private chunkSize = 1000; // Process 1000 characters at a time

  async analyzeLargeText(text: string): Promise<CharacterAnalysis> {
    const chunks = this.splitIntoChunks(text);
    const partialResults: CharacterAnalysis[] = [];

    // Process chunks with yield for non-blocking execution
    for (let i = 0; i < chunks.length; i++) {
      const chunk = chunks[i];
      const analysis = this.analyzeChunk(chunk, i === 0, i === chunks.length - 1);
      partialResults.push(analysis);

      // Yield control back to the event loop
      if (i % 10 === 0) {
        await this.yield();
      }
    }

    return this.mergeResults(partialResults);
  }

  private splitIntoChunks(text: string): string[] {
    const chunks: string[] = [];
    const characters = Array.from(text); // Handle surrogate pairs

    for (let i = 0; i < characters.length; i += this.chunkSize) {
      chunks.push(characters.slice(i, i + this.chunkSize).join(''));
    }

    return chunks;
  }

  private async yield(): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, 0));
  }

  private mergeResults(results: CharacterAnalysis[]): CharacterAnalysis {
    return results.reduce((merged, current) => ({
      totalCharacters: merged.totalCharacters + current.totalCharacters,
      hiragana: merged.hiragana + current.hiragana,
      katakana: merged.katakana + current.katakana,
      kanji: merged.kanji + current.kanji,
      punctuation: merged.punctuation + current.punctuation,
      numbers: merged.numbers + current.numbers,
      latin: merged.latin + current.latin,
      whitespace: merged.whitespace + current.whitespace,
      lineCount: merged.lineCount + current.lineCount - (results.indexOf(current) > 0 ? 1 : 0),
      byteCount: {
        utf8: merged.byteCount.utf8 + current.byteCount.utf8,
        utf16: merged.byteCount.utf16 + current.byteCount.utf16,
        sjis: merged.byteCount.sjis + current.byteCount.sjis
      },
      manuscriptPages: Math.ceil((merged.totalCharacters + current.totalCharacters) / 400)
    }));
  }
}

User Interface Implementation

Real-time Display Components

interface CounterDisplayProps {
  analysis: CharacterAnalysis;
  isProcessing: boolean;
}

class CharacterCounterDisplay {
  private container: HTMLElement;
  private counters: Map<string, HTMLElement> = new Map();

  constructor(containerId: string) {
    this.container = document.getElementById(containerId)!;
    this.setupDisplay();
  }

  private setupDisplay(): void {
    this.container.innerHTML = `
      <div class="counter-grid">
        <div class="counter-section primary">
          <div class="counter-item total">
            <span class="label">Total Characters</span>
            <span class="count" data-counter="total">0</span>
          </div>
        </div>

        <div class="counter-section japanese">
          <h3>Japanese Characters</h3>
          <div class="counter-item hiragana">
            <span class="label">ひらがな (Hiragana)</span>
            <span class="count" data-counter="hiragana">0</span>
          </div>
          <div class="counter-item katakana">
            <span class="label">カタカナ (Katakana)</span>
            <span class="count" data-counter="katakana">0</span>
          </div>
          <div class="counter-item kanji">
            <span class="label">漢字 (Kanji)</span>
            <span class="count" data-counter="kanji">0</span>
          </div>
        </div>

        <div class="counter-section metrics">
          <h3>Text Metrics</h3>
          <div class="counter-item lines">
            <span class="label">Lines</span>
            <span class="count" data-counter="lines">0</span>
          </div>
          <div class="counter-item pages">
            <span class="label">Manuscript Pages (400字)</span>
            <span class="count" data-counter="pages">0</span>
          </div>
        </div>

        <div class="counter-section encoding">
          <h3>Byte Count</h3>
          <div class="counter-item utf8">
            <span class="label">UTF-8</span>
            <span class="count" data-counter="utf8">0</span>
          </div>
          <div class="counter-item sjis">
            <span class="label">Shift-JIS</span>
            <span class="count" data-counter="sjis">0</span>
          </div>
        </div>
      </div>
    `;

    // Cache counter elements
    this.container.querySelectorAll('[data-counter]').forEach(el => {
      const counter = el.getAttribute('data-counter')!;
      this.counters.set(counter, el as HTMLElement);
    });
  }

  /**
   * Updates display with smooth animations
   */
  updateDisplay(analysis: CharacterAnalysis): void {
    const updates: Array<[string, number]> = [
      ['total', analysis.totalCharacters],
      ['hiragana', analysis.hiragana],
      ['katakana', analysis.katakana],
      ['kanji', analysis.kanji],
      ['lines', analysis.lineCount],
      ['pages', analysis.manuscriptPages],
      ['utf8', analysis.byteCount.utf8],
      ['sjis', analysis.byteCount.sjis]
    ];

    updates.forEach(([counter, value]) => {
      this.animateCounterUpdate(counter, value);
    });
  }

  private animateCounterUpdate(counter: string, newValue: number): void {
    const element = this.counters.get(counter);
    if (!element) return;

    const currentValue = parseInt(element.textContent || '0');

    // Animate number change for visual feedback
    this.animateNumber(element, currentValue, newValue, 200);
  }

  private animateNumber(
    element: HTMLElement, 
    start: number, 
    end: number, 
    duration: number
  ): void {
    const startTime = performance.now();

    const updateNumber = (currentTime: number) => {
      const elapsed = currentTime - startTime;
      const progress = Math.min(elapsed / duration, 1);

      // Easing function for smooth animation
      const easeOutQuart = 1 - Math.pow(1 - progress, 4);
      const current = Math.round(start + (end - start) * easeOutQuart);

      element.textContent = current.toLocaleString();

      if (progress < 1) {
        requestAnimationFrame(updateNumber);
      }
    };

    requestAnimationFrame(updateNumber);
  }
}

Advanced Features and Considerations

Manuscript Paper Calculation

Practical Metrics: Manuscript paper calculation, byte count in various encodings

Traditional Japanese manuscript paper (原稿用紙, genkō yōshi) uses a 400-character format (20×20 grid). This calculation is crucial for academic and professional writing:

class ManuscriptCalculator {
  static readonly STANDARD_PAGE_SIZE = 400; // 20x20 grid
  static readonly CHARACTERS_PER_LINE = 20;
  static readonly LINES_PER_PAGE = 20;

  /**
   * Calculates manuscript paper requirements
   * Accounts for Japanese text formatting rules
   */
  static calculateManuscriptMetrics(analysis: CharacterAnalysis): {
    pages: number;
    partialPage: number;
    formattedLines: number;
    recommendedSpacing: string;
  } {
    const totalChars = analysis.totalCharacters;
    const pages = Math.floor(totalChars / this.STANDARD_PAGE_SIZE);
    const remainder = totalChars % this.STANDARD_PAGE_SIZE;

    return {
      pages: pages + (remainder > 0 ? 1 : 0),
      partialPage: remainder,
      formattedLines: Math.ceil(totalChars / this.CHARACTERS_PER_LINE),
      recommendedSpacing: this.getSpacingRecommendation(analysis)
    };
  }

  private static getSpacingRecommendation(analysis: CharacterAnalysis): string {
    const density = analysis.kanji / analysis.totalCharacters;

    if (density > 0.6) return 'dense-kanji';
    if (density < 0.2) return 'kana-heavy';
    return 'balanced';
  }
}

Multi-Encoding Support

Different systems require different character encodings. Accurate byte counting helps with system compatibility:

class EncodingAnalyzer {
  /**
   * Provides accurate byte counts for legacy systems
   * Particularly important for Shift-JIS compatibility
   */
  static getDetailedEncodingInfo(text: string): {
    utf8: { bytes: number; efficiency: number };
    utf16: { bytes: number; efficiency: number };
    shiftJIS: { bytes: number; compatibility: number };
  } {
    const charCount = Array.from(text).length;

    return {
      utf8: {
        bytes: new TextEncoder().encode(text).length,
        efficiency: this.calculateEfficiency(text, 'utf8')
      },
      utf16: {
        bytes: text.length * 2,
        efficiency: this.calculateEfficiency(text, 'utf16')
      },
      shiftJIS: {
        bytes: this.estimateSJISBytes(text),
        compatibility: this.assessSJISCompatibility(text)
      }
    };
  }

  private static calculateEfficiency(text: string, encoding: string): number {
    // Efficiency metric: characters per byte
    const bytes = encoding === 'utf8' 
      ? new TextEncoder().encode(text).length 
      : text.length * 2;
    return Array.from(text).length / bytes;
  }

  private static assessSJISCompatibility(text: string): number {
    // Returns percentage of characters that can be encoded in Shift-JIS
    let compatible = 0;
    const chars = Array.from(text);

    for (const char of chars) {
      if (this.isSJISCompatible(char)) compatible++;
    }

    return chars.length > 0 ? compatible / chars.length : 1;
  }

  private static isSJISCompatible(char: string): boolean {
    const code = char.codePointAt(0)!;

    // Simplified Shift-JIS compatibility check
    return (
      code <= 0x7F ||                          // ASCII
      (code >= 0xFF61 && code <= 0xFF9F) ||    // Half-width katakana
      (code >= 0x4E00 && code <= 0x9FAF)       // Common kanji
    );
  }
}

Technical Challenges and Solutions

Surrogate Pair Handling

Modern JavaScript requires careful handling of Unicode surrogate pairs for characters outside the Basic Multilingual Plane:

// Incorrect: May split surrogate pairs
const incorrectLength = text.length;

// Correct: Handles surrogate pairs properly
const correctLength = Array.from(text).length;

// Alternative using spread operator
const alternativeLength = [...text].length;

Performance Considerations

Performance vs Features: Real-time processing requires careful balance

Key optimization strategies include:

Debounced Input Processing: Prevents excessive computation during rapid typing
Character Range Optimization: Efficient Unicode range checking using binary search
Incremental Analysis: Breaking large texts into manageable chunks
Web Worker Utilization: Offloading computation from the main thread

Conclusion

Kantan Tools' character counter demonstrates sophisticated understanding of Japanese text processing requirements. By implementing Unicode-aware character classification, real-time performance optimization, and practical features like manuscript paper calculation, it addresses the unique challenges of Japanese text analysis.

Building TextCounter-JP taught me that great ideas often come from improving existing solutions rather than starting from scratch. While Kantan Tools provided the initial inspiration, focusing on the specific needs of Japanese text processing allowed me to create something truly specialized.

The technical implementation showcases modern web development best practices: progressive enhancement, accessibility-conscious design, and performance-optimized algorithms. For developers working with multilingual text processing, especially CJK languages, these techniques provide a solid foundation for building robust, user-friendly tools.

For more technical discussions on Japanese text processing and character encoding, explore the linked resources above or contribute to the ongoing development of multilingual web tools.

DEV Community