From Image to Music Language: Starry's Two-Stage Approach for Complex Polyphonic OMR
Paper: arXiv:2604.20522
Github: FindLab-org/starry
Live demo: ✨Starry space on Hugging Face
Abstract
We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.
What problem we focus on
In complex piano notation, the main bottleneck is often not detecting individual noteheads, stems, or rests, but deciding how locally plausible events should be assembled into voices and timing. Several voices may overlap at nearly the same horizontal position, partial voices may appear only locally, and rhythmic logic can depend on tuplets, grace notes, or whole-measure rest conventions.
Multi-voice overlap. Local geometry alone is not enough to decide voice continuation: events that appear close together horizontally may belong to different voices, while a usable score requires globally coherent voice chains and time positions.
This is why Starry treats complex OMR as a structured decoding problem, not only as an image recognition problem.
A two-stage view of OMR
Starry treats OMR as a staged transformation from images to editable music structure. The visual system operates at per-page and per-staff levels to produce event candidates, and regulation resolves measure-level topology into a serialized representation that can be exported to standard music-language formats such as MusicXML and LilyPond.
Starry overview. A compact overview of the full pipeline, from images through visual candidate generation and regulation to downstream music-language formats.
Visual evidence before structural commitment
The visual side of the system is designed to produce robust local evidence before any global voice-structure decision is made. The pipeline operates at page and staff levels, including layout analysis, staff straightening, semantic heatmaps, foreground masks, bracket recognition, OCR-related text tokens, and candidate assembly.
Visual pipeline. The system first produces local recognition evidence and event candidates, then delays global structural commitment until the regulation stage.
A few examples of the visual predictors are shown below.
These predictors do not need to solve the whole score by themselves. Their job is to create useful candidates and local hints for the regulation stage.
Assembly. Local semantic detections are grouped into measure-level event candidates, with attributes derived from geometric measurements and confidence cues from recognition.
Regulation as topology decoding
The core regulation method in Starry is BeadSolver. It treats measure-level decoding as a topology problem. Instead of asking a model to predict a complete polyphonic score structure in one shot, BeadSolver uses probability-guided tree search to explore candidate voice-chain assignments among event candidates.
Raw ambiguous measure before regulation.
Regulated voice structure after topology decoding.
The key idea is to separate local evidence from structural commitment. A learned model estimates which continuation is plausible at each step, while the solver keeps multiple possibilities alive long enough to compare their consequences at the measure level.
Voice chains, timing, and measure consistency
In a topology view, events are linked into ordered voice-wise chains. The main question is not only which symbols were detected, but how they continue across a measure and how different voices interleave.
Voice chains as topology. Events are linked into ordered voice-wise chains rather than interpreted independently.
Once a candidate chain structure is fixed, the solver can derive ticks, durations, and voice-wise timelines. It can also reject solutions that do not yield a coherent measure-level interpretation.
From topology to timing. A topology candidate becomes useful only if it can produce a coherent timing structure.
BeadSolver as a decision process
One useful intuition is to formulate multi-voice structure decoding as a Markov decision process. The solver incrementally chooses continuations, while the state records the committed prefix and the remaining candidates.
In this view, a barline can be imagined as a kind of space-time portal. The decoder may finish one voice, jump through the barline, and continue from the beginning of another voice, while still building one coherent measure-level topology.
Barlines as portals. The path shows how decoding can move across voices while still building one coherent measure-level topology.
BeadPicker: the learned model inside the solver
BeadPicker is the learned model used by BeadSolver. At each step, it reads the current measure candidates together with the committed prefix. It then estimates a probability distribution over which event should come next, including voice-closing boundary markers. It also provides duration predispositions and tick estimates used by the evaluator.
BeadPicker architecture. The model reads measure-level candidates, geometry, local hints, and prefix context, then produces successor probabilities and related fields used by the solver.
This combination is important: the model provides local probabilistic guidance, and the solver enforces global structure by searching and evaluating measure-level consistency.
Try it yourself
You can try Starry directly in the live demo:
Open the Starry live demo on Hugging Face
The demo accepts score images or PDFs, runs Starry recognition, lets you inspect and edit the recognized result, and exports structured notation to formats such as MusicXML.
Related work and repositories
- Lotus: an SVG/LilyPond geometry pipeline used to recover per-glyph positions from engraved scores, giving generated topology samples realistic spatial layouts.
- Paraff: a compact symbolic-music DSL used in the data-generation strategy; this project is now archived, with Lilylet as its successor.
- Lilylet: a symbolic-music language designed as a LilyPond variant. Starry OMR supports Lilylet as one of its export formats for structured recognition results.
- IMSLP-Mining: a related data-mining project for converting open sheet-music images into usable symbolic datasets.
















