Full-stack AI-powered language learning application with custom ML models for translation and real-time tone correction.

DuShuo: Chinese learning from the content I actually consume

What I’ve built (so far)

Chrome extension: pinyin overlay + hover definitions + one-click vocab saves + video subtitle layer (pinyin/English/Chinese overlays for platforms like Netflix/Bilibili/Disney+).

Dictionary foundation: merged + normalized ~123k entries; added context/examples/metadata via an ETL pipeline using API calls to LLMs. Built multiple Evals for quality and cost optimization.

Generative quizzes: established question generation prompt, and schema for 21 different question types (multiple choice, open response, cloze fill, grammar, etc.) with grading and feedback.

Implemented Speech to Text module to provide a real-time karaoke-like tone feedback experience. After testing a mix of different methods I ultimately settled on MFCC and normalize audio input for its simplicity and accuracy.

My role: end-to-end product + engineering (research, UX, backend, frontend, model routing, evaluation + cost).

DuShuo: Extension-first

I built DuShuo because I felt the content I was using to learn Chinese was not the content I actually enjoyed consuming. When I did find a show or article that I liked, I’d get frustrated by the amount of time it took to look up the meaning of a word I didn’t know. I often needed to look at multiple sources to truely understand the meaning and nuances of a word, which would cause me to break flow bouncing between tabs and applications (dictionary → notes → flashcards → back to content). The friction was small, but it happened often enough that I decided I needed a tool to help me learn Chinese.

In my search for a tool to help me learn Chinese, I found that most of the tools available were either too simple, or too expensive for the limited utility. So I came up with the idea for Dushuo - 读说. It started as a Chrome extension that turns any chinese character on a webpage into an engaging learning tool. It allowed users to hover a word for meaning, display pinyin, and save vocabulary in one click. In follow up itterations it allowed users to attach the same layer to streaming subtitles. Through a mix of subtitle extraction, and ASR powered STT. This required a bit of magic to get right, but it was a fun challenge to buffer the audio stream, and align it with the subtitle stream. Thanks to open source models like SenseVoice Small, I was able to run this on my own hardware, and build in model optimizations to make it run fast and reliable.

Part II: Building the Data Foundation

Building the dictionary: LLM Pipelines & Data Enhancement

As I started to collect vocabulary terms, and organize them into collections, I realized that I needed a proper dictionary to tie into the extension. Sources like CC-CEDICT, Moedict, Baidu’s Baike, and Unihan provided a solid foundation, but they each lacked the pedagogical metadata needed for a true learning platform, often lacking a complete set of features that describe modern usage, nuance (“formal”, “internet slang”, “polite but stiff”), senses, parts of speech, patterns, and related terms.

So I built an ETL pipeline that merges multiple sources and then enriches entries using LLMs into a unified schema that is useful.

Final Schema Priorities

The resulting dictionary structure was optimized for:

Decomposition — Radical and component breakdown for character learning
Context — Encyclopedic descriptions from Baike enrichment
Parts of Speech — Grammatical categorization for sentence construction
Social Nuance — Register, formality, and usage context
Semantic Relations — Related words and synonyms for vocabulary expansion
Vector Embeddings — BGE-M3 embeddings for hybrid search (dense + sparse)

Part III: Generative quizzes: my main goal was reliability, not creativity

I wanted infinite practice items without shipping a brittle hand-authored question bank. But letting a LLM free-write questions would lead to unparseable outputs that aren’t compatible with a non-chat UI. So I sought out to create a system where quizzes are generated via tool calling with strict schemas. The model can be creative inside constraints, but the output always matches what the frontend expects. In my testing, question generation stayed under ~1–2 seconds and I was able to produce a diverse set of questions that were unique yet relevant to the user’s learning experience.

The point here isn’t that we should use LLMs to do everything, it’s that Useful AI systems are personal and efficient, and successful implementations need a contract (explicit context + schema + validation).

I chose to use LLMs here because standard SRS (Spaced Repetition Systems) like Anki or Quizlet fail to capture the nuance required for true language mastery. A simple “Did you know this?” binary doesn’t distinguish between:

Recognizing a word in multiple choice
Producing the word in a sentence
Using proper tone in speech
Selecting the socially appropriate synonym

By leveraging LLM Tool Calling, the system forces AI to generate quiz questions that adhere to strict JSON schemas while maintaining the creativity to produce infinite variations:

{
  "question_type": "fill_in_blank",
  "target_word": "考虑",
  "context_sentence": "我们需要____一下这个计划的可行性。",
  "options": ["考虑", "考试", "考察", "考验"],
  "difficulty": "intermediate",
  "bloom_level": "application"
}

Question Types Generated

Vocabulary
Multiple Choice
Word to Meaning (“What does 你好 mean?”), Meaning to Word (“Which word means hello?”), Choose correct among visually similar (己 vs 已 vs 巳), Identify within selected collection, Identify pinyin tone vs noisy alternatives, Audio recognition → match to characters
Open Response
Define in English, Use in a sentence, Audio recognition → type characters, Type the definition, Cloze Fill (select words from bank)
Grammar
Multiple Choice
Match grammar usage (CN→EN), Match grammar usage (EN→CN), Identify incorrect usage, Distinguish similar structures (best/any applies)
Open Response
Explain (English), Use in a sentence
Reading
Multiple Choice
Reading comprehension (difficulty per form)
Open Response
Reading comprehension (Open), Translate the passage, Typing practice (test for speed and accuracy)

Integrated Grading & Feedback

The LLM doesn’t just generate questions — it also provides tailored feedback by analyzing:

The user’s answer against expected patterns
Socially uncommon word usage
The user’s active vocabulary collection for context-aware hints

The Mathematics of Memory: BKT + Elo Learner Model

The Depth vs. Frequency Framework: Optimizing the Learning Curve

Standard SRS systems have a critical flaw: they treat all successful recalls equally. But cognitively, correctly selecting “狗” from a multiple-choice list is fundamentally different from correctly using “狗” in an original sentence.

Bayesian Knowledge Tracing (BKT) Implementation

The learner model uses BKT to estimate mastery probability based on performance history:

# BKT Parameters
BKT_P_INIT = 0.3   # Initial knowledge probability
BKT_P_LEARN = 0.3  # Learning rate
BKT_P_GUESS = 0.25 # Guess probability
BKT_P_SLIP = 0.1   # Slip probability

def compute_bkt_mastery(correct_count, total_count):
    if correct_rate > p_guess:
        evidence = (correct_rate - p_guess) / (1.0 - p_guess - p_slip)
        p_known = p_init + (1 - p_init) * evidence * p_learn * total_count
    return max(0.0, min(1.0, p_known))

Elo Rating for Item Difficulty

Combined with Elo ratings to match learner ability with item difficulty:

Expected Score = 1 / (1 + 10^((opponent_rating - current_rating) / 400))
New Rating = current_rating + K * (actual_score - expected_score)

Target Selection Categories

Category	Mastery Range	Ratio	Purpose
Review	BKT < 0.4	30%	Reinforce struggling skills
Practice	BKT 0.4-0.7	50%	Consolidate learning
Challenge	BKT > 0.7	20%	Push into i+1 territory

6-Dimensional Skill Embedding System

Each vocabulary item tracks mastery across six cognitive dimensions:

Dimension	Type	Cognitive Focus
emb_pinyin_recall	Recognition	Can identify correct pinyin
emb_meaning_recall	Recognition	Can identify correct meaning
emb_pronunciation_recall	Recognition	Can identify correct pronunciation
emb_usage_reproduction	Production	Can use word in context
emb_meaning_reproduction	Production	Can produce translation
emb_pronunciation_reproduction	Production	Can speak correctly

This multi-dimensional approach allows the system to identify specific weaknesses rather than treating vocabulary as monolithic units.

Decision History: Why BKT + Elo Over Alternatives?

Alternative Considered	Why Rejected
Simple SM2	No probabilistic mastery estimation; binary pass/fail
Half-Life Regression (HLR)	Requires more historical data; cold-start problem
Deep Knowledge Tracing (DKT)	Neural network overhead; interpretability issues
Pure Elo	Designed for competition, not learning progression

Why BKT + Elo Hybrid?

BKT for mastery — Probabilistic model works well with sparse data
Elo for difficulty matching — Matches learner to appropriate challenge level
Interpretable — Can explain “you have 65% mastery of this word”
Computationally light — No GPU required; runs on every request