API Reference¶

Read CHAT data.

Parameters:

path – Path to a .zip file, a local directory containing .cha files, a single .cha file, a git repository URL (ending in .git), or an HTTP/HTTPS URL.
filter_files – Filename(s) to keep. Regular expression matching is supported. If None, all files are included.
filter_participants – Participant code(s) to keep. Regular expression matching is supported. If None, all participants are included.
cls – The class used to create the reader. Must be CHAT or a subclass of it.
strict – If True, enforce strict parsing of the CHAT data.

Returns:

A CHAT instance filtered by the specified files and participants.

Raises:

TypeError – If cls is not CHAT or a subclass of it.
ValueError – If path does not point to a recognized source.

class pylangacq.Age¶

Age in the CHAT format: years;months.days.

years: int¶: Number of years.

months: int | None¶: Number of months.

days: int | None¶: Number of days.

in_months() → float¶: Return the age in total months as a float.

class pylangacq.CHAT¶

CHAT data reader for CHILDES/TalkBank transcripts.

This class parses CHAT transcription files and provides access to utterances, tokens, words, and annotations.

ages() → list[Age | None]¶

Return the age of the target child (CHI) in each file.

Returns:: One Age per file, or None if the file has no CHI or the CHI has no age.

append() → None¶

Append data from another CHAT reader.

Parameters:: other – A CHAT reader whose data to append.

append_left() → None¶

Left-append data from another CHAT reader.

Parameters:: other – A CHAT reader whose data to prepend.

clear() → None¶: Remove all data from this reader.

extend() → None¶

Extend data from multiple CHAT readers.

Parameters:: others – CHAT readers whose data to append.

extend_left() → None¶

Left-extend data from multiple CHAT readers.

Parameters:: others – CHAT readers whose data to prepend.

file_paths¶

Return the list of file paths.

Returns:: File paths or identifiers.

filter(*, files: str | Sequence[str] | None = None, participants: str | Sequence[str] | None = None) → CHAT¶

Return a new CHAT filtered by file path and/or participant regex.

Parameters:

files – Regex pattern(s) to include only matching file paths. Accepts a single string or a sequence of strings. Multiple patterns are OR’d.
participants – Regex pattern(s) to include only matching participant codes. Accepts a single string or a sequence of strings. Patterns are auto-anchored (full match). Multiple patterns are OR’d.

Returns:

A new filtered CHAT reader.

classmethod from_dir(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Recursively load CHAT data from a directory.

Parameters:

path – Directory path to search.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_files(paths: Sequence[str | os.PathLike[str]], *, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Load CHAT data from file paths.

Parameters:

paths – Paths to CHAT files.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_git(url: str, *, rev: str | None = None, depth: int | None = None, match: str | None = None, extension: str = '.cha', cache_dir: str | os.PathLike[str] | None = None, force_download: bool = False, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Load CHAT data from a git repository.

Clones the repository (or uses a cached clone) and parses all matching files from the resulting directory.

Parameters:

url – Git repository URL.
rev – Branch, tag, or commit hash. If None, uses the repository’s default branch.
depth – Clone depth. Defaults to 1 (shallow clone). Ignored when rev is a commit hash.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
cache_dir – Directory for caching cloned repositories. Defaults to ~/.rustling/cache/.
force_download – If True, re-clone even if a cached copy exists.
parallel – If True, use parallel processing.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier. Set to None to disable.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier. Set to None to disable.

Returns:

A new CHAT reader with the parsed data.

classmethod from_strs(strs: Sequence[str], ids: Sequence[str] | None = None, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Parse CHAT data from in-memory strings.

Parameters:

strs – CHAT-formatted strings to parse.
ids – Optional identifiers for each string. If None, UUIDs are generated.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strs and ids have different lengths, or if strict is True and mor/word misalignment is found.

classmethod from_url(url: str, *, match: str | None = None, extension: str = '.cha', cache_dir: str | os.PathLike[str] | None = None, force_download: bool = False, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Load CHAT data from a URL.

Downloads the file (or uses a cached copy) and parses it. ZIP files are automatically detected and extracted.

Parameters:

url – URL to download from.
match – Regex pattern to include only matching file paths (applicable for ZIP files).
extension – File extension to filter by (default: “.cha”, applicable for ZIP files).
cache_dir – Directory for caching downloads. Defaults to ~/.rustling/cache/.
force_download – If True, re-download even if a cached copy exists.
parallel – If True, use parallel processing.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier. Set to None to disable.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier. Set to None to disable.

Returns:

A new CHAT reader with the parsed data.

classmethod from_utterances(utterances: Sequence[Utterance]) → CHAT¶

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries. Raw lines are synthesized from each utterance’s tiers data, so to_strs() and to_files() produce valid CHAT output.

Parameters:: utterances – Utterance objects to include.
Returns:: A new CHAT reader containing the given utterances.

classmethod from_zip(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') → CHAT¶

Load CHAT data from a ZIP archive.

Parameters:

path – Path to the ZIP file.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.
gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

head(n: int = 5) → Utterances¶

Return the first n utterances with a formatted display.

Parameters:: n – Number of utterances to include.
Returns:: An Utterances object that displays as formatted text.

headers() → list[Headers]¶

Return file-level headers.

Returns:: A list of Headers, one per file.

info(*, verbose: bool = False) → None¶

Print a summary of this reader’s data.

Parameters:: verbose – If True, show the details of all files. Defaults to False (shows first 5 files only).

ipsyn(*, participant: str = 'CHI', n: int | None = 100) → list[int]¶

Index of Productive Syntax (IPSyn).

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One score (0-112) per file.

languages(*, by_file: bool = False) → list[str] | list[list[str]]¶

Return languages.

Parameters:: by_file – If True, group languages by file.
Returns:: Language codes, optionally grouped by file.

mlu(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in morphemes.

Alias for mlum().

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mlum(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in morphemes.

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mluw(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in words.

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

n_files¶

Return the number of files.

Returns:: Number of loaded files.

participants(*, by_file: bool = False) → list[Participant] | list[list[Participant]]¶

Return participants.

Parameters:: by_file – If True, group participants by file.
Returns:: Participants, optionally grouped by file.

pop() → CHAT¶

Remove and return the last file as a new CHAT reader.

Returns:: A new CHAT reader containing the removed file.
Raises:: IndexError – If the reader is empty.

pop_left() → CHAT¶

Remove and return the first file as a new CHAT reader.

Returns:: A new CHAT reader containing the removed file.
Raises:: IndexError – If the reader is empty.

tail(n: int = 5) → Utterances¶

Return the last n utterances with a formatted display.

Parameters:: n – Number of utterances to include.
Returns:: An Utterances object that displays as formatted text.

to_conllu() → CoNLLU¶

Convert to a CoNLL-U object.

Returns:: A CoNLLU object.

to_conllu_files(*, filenames: Sequence[str] | None = None) → None¶

Write CoNLL-U (.conllu) files to a directory.

Parameters:

dir_path – Directory path to write .conllu files to.
filenames – Custom filenames for the output files.

Raises:

ValueError – If filenames count doesn’t match file count.
IOError – If writing fails.

to_conllu_strs() → list[str]¶: Return CoNLL-U format strings, one per file.

to_elan() → ELAN¶

Convert to an ELAN object.

Each CHAT file produces one ELAN file. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Returns:: An ELAN object.

to_elan_files(*, filenames: Sequence[str] | None = None) → None¶

Write ELAN (.eaf) files to a directory.

Converts the CHAT data to ELAN XML format and writes .eaf files. Each CHAT file produces one .eaf file. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Parameters:

dir_path – Directory path to write .eaf files to.
filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths with the extension changed to .eaf (e.g., foo.cha becomes foo.eaf). Falls back to 0001.eaf, 0002.eaf, etc. when the data was parsed from in-memory strings.

Raises:

ValueError – If filenames count doesn’t match file count.
IOError – If writing fails.

to_elan_strs() → list[str]¶

Return EAF XML strings, one per file.

Converts the CHAT data to ELAN XML format. Each CHAT file produces one EAF XML string. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Returns:: A list of EAF XML strings.

to_files(*, filenames: Sequence[str] | None = None) → None¶

Write CHAT (.cha) files to a directory.

Parameters:

dir_path – Directory path to write .cha files to.
filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths (e.g., foo.cha stays foo.cha). Falls back to 0001.cha, 0002.cha, etc. when the data was parsed from in-memory strings.

Raises:

ValueError – If filenames count doesn’t match file count.
IOError – If writing fails.

to_srt(*, participants: Sequence[str] | None = None) → SRT¶

Convert to an SRT object.

Each CHAT file produces one SRT file. When multiple participants are present, subtitle text is prefixed with the participant code (e.g., "CHI: more cookie ."). Utterances without time marks are skipped.

Parameters:: participants – Participant codes to include. If None, all participants are included.
Returns:: A SRT object.

to_srt_files(*, participants: Sequence[str] | None = None, filenames: Sequence[str] | None = None) → None¶

Write SRT (.srt) files to a directory.

Parameters:

dir_path – Directory path to write .srt files to.
participants – Participant codes to include. If None, all participants are included.
filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths with the extension changed to .srt.

Raises:

ValueError – If filenames count doesn’t match file count.
IOError – If writing fails.

to_srt_strs(*, participants: Sequence[str] | None = None) → list[str]¶

Return SRT format strings, one per file.

Parameters:: participants – Participant codes to include. If None, all participants are included. Utterances without time marks are skipped.
Returns:: A list of SRT-formatted strings.

to_strs() → list[str]¶

Return CHAT data strings, one per file.

Returns:: A list of CHAT-formatted strings.

to_textgrid(*, participants: Sequence[str] | None = None) → TextGrid¶: Convert to a TextGrid object.

to_textgrid_files(*, participants: Sequence[str] | None = None, filenames: Sequence[str] | None = None) → None¶: Write TextGrid (.TextGrid) files to a directory.

to_textgrid_strs(*, participants: Sequence[str] | None = None) → list[str]¶: Return TextGrid format strings, one per file.

tokens(*, by_utterance: bool = False, by_file: bool = False) → list[Token] | list[list[Token]] | list[list[list[Token]]]¶

Return tokens.

Parameters:

by_utterance – If True, group tokens by utterance.
by_file – If True, group tokens by file.

Returns:

Tokens with optional grouping.

ttr(*, participant: str = 'CHI', n: int | None = 350) → list[float]¶

Type-token ratio for non-punctuation words.

Parameters:

participant – Target participant code.
n – Number of tokens to use per file. None for all.

Returns:

One value per file.

utterances(*, by_file: bool = False) → list[Utterance] | list[list[Utterance]]¶

Return utterances.

Parameters:: by_file – If True, group utterances by file.
Returns:: Utterances, optionally grouped by file.

word_ngrams(n: int) → Ngrams¶

Return an Ngrams for word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:: n – The n-gram order (1 for unigrams, 2 for bigrams, etc.).
Returns:: An Ngrams with the accumulated counts.
Raises:: ValueError – If n < 1.

words(*, by_utterance: bool = False, by_file: bool = False) → list[str] | list[list[str]] | list[list[list[str]]]¶

Return words.

Parameters:

by_utterance – If True, group words by utterance.
by_file – If True, group words by file.

Returns:

Words with optional grouping.

class pylangacq.ChangeableHeader¶

A changeable header that can appear mid-file in CHAT transcripts.

Variants: Activities, Bck, Bg, Blank, Comment, Date, Eg, G, NewEpisode, Page, Situation.

class pylangacq.Gra(dep, head, rel)¶

A grammatical relation from the %gra tier.

dep: int¶: Position of the dependent word.

head: int¶: Position of the head word.

rel: str¶: Grammatical relation type.

class pylangacq.Headers¶

File-level headers from a CHAT file.

pid: str | None¶: Persistent identifier from @PID.

languages: list[str]¶: Language codes from @Languages.

participants: list[Participant]¶: Participants from @Participants and @ID.

options: str | None¶: Options from @Options.

media: dict[str, str | None] | None¶: Media descriptor from @Media as a dict with keys “filename”, “format”, and “status”.

date: datetime.date | None¶: Date from @Date, parsed as a date object. The CHAT format DD-MMM-YYYY (e.g., 25-JAN-1983) is tried first, then ISO YYYY-MM-DD. If neither format matches, the value is None.

location: str | None¶: Location from @Location.

number: str | None¶: Number of participants from @Number.

recording_quality: str | None¶: Recording quality from @Recording Quality.

room_layout: str | None¶: Room layout from @Room Layout.

tape_location: str | None¶: Tape location from @Tape Location.

time_duration: str | None¶: Time duration from @Time Duration.

time_start: str | None¶: Time start from @Time Start.

transcriber: str | None¶: Transcriber from @Transcriber.

transcription: str | None¶: Transcription type from @Transcription.

types: str | None¶: Types from @Types.

videos: str | None¶: Videos from @Videos.

warning: str | None¶: Warning from @Warning.

situation: str | None¶: Situation from @Situation.

comments: list[str] | None¶: All @Comment values from the header section, in order.

other: dict[str, str]¶: Unrecognized headers as key-value pairs.

class pylangacq.Ngrams(n: int, *, min_n: int | None = None) → None¶

An counter for storing n-grams efficiently and counting their frequencies.

Accumulates n-gram counts from sequences of elements. N-grams do not cross sequence boundaries.

clear() → None¶: Clear all counts.

count(seq: Sequence[str]) → None¶

Count n-grams from a single sequence.

Parameters:: seq – A sequence of elements to extract n-grams from.

count_seqs(seqs: Sequence[Sequence[str]]) → None¶

Count n-grams from multiple sequences.

Parameters:: seqs – An iterable of sequences.

get(ngram: Sequence[str]) → int¶

Return the count for a specific n-gram.

Parameters:: ngram – The n-gram to look up.
Returns:: The count, or 0 if not observed.

items(*, order: int | None = None) → list[tuple[tuple[str, ...], int]]¶

Return all (n-gram, count) pairs.

Parameters:: order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).
Returns:: A list of (ngram_tuple, count) pairs.
Raises:: ValueError – If order is out of range.

min_n¶: The minimum n-gram order.

most_common(n: int | None = None, *, order: int | None = None) → list[tuple[tuple[str, ...], int]]¶

Return the n most common n-grams with their counts.

Parameters:

n – Number of top entries to return. If None, returns all n-grams sorted by count (descending).
order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).

Returns:

A list of (ngram_tuple, count) pairs sorted by count.

Raises:

ValueError – If order is out of range.

n¶: The n-gram order.

to_counter(*, order: int | None = None) → Counter[tuple[str, ...]]¶

Convert to a collections.Counter.

Parameters:: order – If specified, only include n-grams of this specific order. Must be between min_n and n (inclusive). If None, defaults to the highest order (n).
Returns:: A Counter mapping n-gram tuples to their counts.
Raises:: ValueError – If order is out of range.

total(*, order: int | None = None) → int¶

Return the total number of n-gram tokens counted.

Parameters:: order – If specified, return total for this specific order only. Must be between min_n and n (inclusive). If None, returns the sum across all orders.
Returns:: Total count.
Raises:: ValueError – If order is out of range.

class pylangacq.Participant¶

A participant from @Participants and @ID headers.

code: str¶: Three-letter speaker ID (e.g., “CHI”).

name: str¶: Speaker name (may be empty).

role: str¶: Standard role (e.g., “Target_Child”).

language: str | None¶: Language from @ID.

corpus: str | None¶: Corpus name from @ID.

age: Age | None¶: Age from @ID.

sex: str | None¶: Sex from @ID.

group: str | None¶: Group from @ID.

ses: str | None¶: Ethnicity/SES from @ID.

education: str | None¶: Education level from @ID.

custom: str | None¶: Custom field from @ID.

birth: str | None¶: Birth date from @Birth of header.

birthplace: str | None¶: Birthplace from @Birthplace of header.

l1: str | None¶: First language from @L1 of header.

class pylangacq.Token(word, pos=None, mor=None, gra=None)¶: A single token from a CoNLL-U file (10 tab-separated fields).

class pylangacq.Utterance(*, index: int, line: str, time_marks: tuple[int, int]) → None¶

A single subtitle block within an SRT file.

audible¶: Audibly faithful transcript of this utterance, or None for headers.

to_str()¶: Return a plain text tabular representation of this utterance.

class pylangacq.Utterances¶

A sequence of utterances with formatted display.

Returned by CHAT.head() and CHAT.tail(). Displays as column-aligned plain text in the terminal and as HTML tables in Jupyter notebooks.