API Reference

pylangacq.read_chat(path: str | os.PathLike[str], *, filter_files: str | Sequence[str] | None = None, filter_participants: str | Sequence[str] | None = None, cls: type[CHAT] = <class 'builtins.CHAT'>, strict: bool = True) CHAT

Read CHAT data.

Parameters:
  • path – Path to a .zip file, a local directory containing .cha files, a single .cha file, a git repository URL (ending in .git), or an HTTP/HTTPS URL.

  • filter_files – Filename(s) to keep. Regular expression matching is supported. If None, all files are included.

  • filter_participants – Participant code(s) to keep. Regular expression matching is supported. If None, all participants are included.

  • cls – The class used to create the reader. Must be CHAT or a subclass of it.

  • strict – If True, enforce strict parsing of the CHAT data.

Returns:

A CHAT instance filtered by the specified files and participants.

Raises:
  • TypeError – If cls is not CHAT or a subclass of it.

  • ValueError – If path does not point to a recognized source.

class pylangacq.Age

Age in the CHAT format: years;months.days.

years: int

Number of years.

months: int | None

Number of months.

days: int | None

Number of days.

in_months() float

Return the age in total months as a float.

class pylangacq.CHAT

CHAT data reader for CHILDES/TalkBank transcripts.

This class parses CHAT transcription files and provides access to utterances, tokens, words, and annotations.

ages() list[Age | None]

Return the age of the target child (CHI) in each file.

Returns:

One Age per file, or None if the file has no CHI or the CHI has no age.

append() None

Append data from another CHAT reader.

Parameters:

other – A CHAT reader whose data to append.

append_left() None

Left-append data from another CHAT reader.

Parameters:

other – A CHAT reader whose data to prepend.

clear() None

Remove all data from this reader.

extend() None

Extend data from multiple CHAT readers.

Parameters:

others – CHAT readers whose data to append.

extend_left() None

Left-extend data from multiple CHAT readers.

Parameters:

others – CHAT readers whose data to prepend.

file_paths

Return the list of file paths.

Returns:

File paths or identifiers.

filter(*, files: str | Sequence[str] | None = None, participants: str | Sequence[str] | None = None) CHAT

Return a new CHAT filtered by file path and/or participant regex.

Parameters:
  • files – Regex pattern(s) to include only matching file paths. Accepts a single string or a sequence of strings. Multiple patterns are OR’d.

  • participants – Regex pattern(s) to include only matching participant codes. Accepts a single string or a sequence of strings. Patterns are auto-anchored (full match). Multiple patterns are OR’d.

Returns:

A new filtered CHAT reader.

classmethod from_dir(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Recursively load CHAT data from a directory.

Parameters:
  • path – Directory path to search.

  • match – Regex pattern to include only matching file paths.

  • extension – File extension to filter by (default: “.cha”).

  • parallel – If True, use parallel processing. Set to False to disable multithreading.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_files(paths: Sequence[str | os.PathLike[str]], *, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Load CHAT data from file paths.

Parameters:
  • paths – Paths to CHAT files.

  • parallel – If True, use parallel processing. Set to False to disable multithreading.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_git(url: str, *, rev: str | None = None, depth: int | None = None, match: str | None = None, extension: str = '.cha', cache_dir: str | os.PathLike[str] | None = None, force_download: bool = False, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Load CHAT data from a git repository.

Clones the repository (or uses a cached clone) and parses all matching files from the resulting directory.

Parameters:
  • url – Git repository URL.

  • rev – Branch, tag, or commit hash. If None, uses the repository’s default branch.

  • depth – Clone depth. Defaults to 1 (shallow clone). Ignored when rev is a commit hash.

  • match – Regex pattern to include only matching file paths.

  • extension – File extension to filter by (default: “.cha”).

  • cache_dir – Directory for caching cloned repositories. Defaults to ~/.rustling/cache/.

  • force_download – If True, re-clone even if a cached copy exists.

  • parallel – If True, use parallel processing.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier. Set to None to disable.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier. Set to None to disable.

Returns:

A new CHAT reader with the parsed data.

classmethod from_strs(strs: Sequence[str], ids: Sequence[str] | None = None, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Parse CHAT data from in-memory strings.

Parameters:
  • strs – CHAT-formatted strings to parse.

  • ids – Optional identifiers for each string. If None, UUIDs are generated.

  • parallel – If True, use parallel processing. Set to False to disable multithreading.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strs and ids have different lengths, or if strict is True and mor/word misalignment is found.

classmethod from_url(url: str, *, match: str | None = None, extension: str = '.cha', cache_dir: str | os.PathLike[str] | None = None, force_download: bool = False, parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Load CHAT data from a URL.

Downloads the file (or uses a cached copy) and parses it. ZIP files are automatically detected and extracted.

Parameters:
  • url – URL to download from.

  • match – Regex pattern to include only matching file paths (applicable for ZIP files).

  • extension – File extension to filter by (default: “.cha”, applicable for ZIP files).

  • cache_dir – Directory for caching downloads. Defaults to ~/.rustling/cache/.

  • force_download – If True, re-download even if a cached copy exists.

  • parallel – If True, use parallel processing.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier. Set to None to disable.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier. Set to None to disable.

Returns:

A new CHAT reader with the parsed data.

classmethod from_utterances(utterances: Sequence[Utterance]) CHAT

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries. Raw lines are synthesized from each utterance’s tiers data, so to_strs() and to_files() produce valid CHAT output.

Parameters:

utterances – Utterance objects to include.

Returns:

A new CHAT reader containing the given utterances.

classmethod from_zip(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True, mor_tier: str | None = '%mor', gra_tier: str | None = '%gra') CHAT

Load CHAT data from a ZIP archive.

Parameters:
  • path – Path to the ZIP file.

  • match – Regex pattern to include only matching file paths.

  • extension – File extension to filter by (default: “.cha”).

  • parallel – If True, use parallel processing. Set to False to disable multithreading.

  • strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

  • mor_tier – Name of the dependent tier to treat as the morphology tier, e.g. "%mor" or "%xmor". Set to None to disable mor+gra handling.

  • gra_tier – Name of the dependent tier to treat as the grammatical relation tier, e.g. "%gra" or "%xgra". Set to None to disable mor+gra handling.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

head(n: int = 5) Utterances

Return the first n utterances with a formatted display.

Parameters:

n – Number of utterances to include.

Returns:

An Utterances object that displays as formatted text.

headers() list[Headers]

Return file-level headers.

Returns:

A list of Headers, one per file.

info(*, verbose: bool = False) None

Print a summary of this reader’s data.

Parameters:

verbose – If True, show the details of all files. Defaults to False (shows first 5 files only).

ipsyn(*, participant: str = 'CHI', n: int | None = 100) list[int]

Index of Productive Syntax (IPSyn).

Parameters:
  • participant – Target participant code.

  • n – Number of utterances to use per file. None for all.

Returns:

One score (0-112) per file.

languages(*, by_file: bool = False) list[str] | list[list[str]]

Return languages.

Parameters:

by_file – If True, group languages by file.

Returns:

Language codes, optionally grouped by file.

mlu(*, participant: str = 'CHI', n: int | None = 100) list[float]

Mean length of utterance in morphemes.

Alias for mlum().

Parameters:
  • participant – Target participant code.

  • n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mlum(*, participant: str = 'CHI', n: int | None = 100) list[float]

Mean length of utterance in morphemes.

Parameters:
  • participant – Target participant code.

  • n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mluw(*, participant: str = 'CHI', n: int | None = 100) list[float]

Mean length of utterance in words.

Parameters:
  • participant – Target participant code.

  • n – Number of utterances to use per file. None for all.

Returns:

One value per file.

n_files

Return the number of files.

Returns:

Number of loaded files.

participants(*, by_file: bool = False) list[Participant] | list[list[Participant]]

Return participants.

Parameters:

by_file – If True, group participants by file.

Returns:

Participants, optionally grouped by file.

pop() CHAT

Remove and return the last file as a new CHAT reader.

Returns:

A new CHAT reader containing the removed file.

Raises:

IndexError – If the reader is empty.

pop_left() CHAT

Remove and return the first file as a new CHAT reader.

Returns:

A new CHAT reader containing the removed file.

Raises:

IndexError – If the reader is empty.

tail(n: int = 5) Utterances

Return the last n utterances with a formatted display.

Parameters:

n – Number of utterances to include.

Returns:

An Utterances object that displays as formatted text.

to_conllu() CoNLLU

Convert to a CoNLL-U object.

Returns:

A CoNLLU object.

to_conllu_files(*, filenames: Sequence[str] | None = None) None

Write CoNLL-U (.conllu) files to a directory.

Parameters:
  • dir_path – Directory path to write .conllu files to.

  • filenames – Custom filenames for the output files.

Raises:
  • ValueError – If filenames count doesn’t match file count.

  • IOError – If writing fails.

to_conllu_strs() list[str]

Return CoNLL-U format strings, one per file.

to_elan() ELAN

Convert to an ELAN object.

Each CHAT file produces one ELAN file. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Returns:

An ELAN object.

to_elan_files(*, filenames: Sequence[str] | None = None) None

Write ELAN (.eaf) files to a directory.

Converts the CHAT data to ELAN XML format and writes .eaf files. Each CHAT file produces one .eaf file. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Parameters:
  • dir_path – Directory path to write .eaf files to.

  • filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths with the extension changed to .eaf (e.g., foo.cha becomes foo.eaf). Falls back to 0001.eaf, 0002.eaf, etc. when the data was parsed from in-memory strings.

Raises:
  • ValueError – If filenames count doesn’t match file count.

  • IOError – If writing fails.

to_elan_strs() list[str]

Return EAF XML strings, one per file.

Converts the CHAT data to ELAN XML format. Each CHAT file produces one EAF XML string. Participants become alignable tiers, and dependent tiers (e.g., %mor, %gra) become reference annotation tiers named {tier}@{participant} (e.g., mor@CHI).

Returns:

A list of EAF XML strings.

to_files(*, filenames: Sequence[str] | None = None) None

Write CHAT (.cha) files to a directory.

Parameters:
  • dir_path – Directory path to write .cha files to.

  • filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths (e.g., foo.cha stays foo.cha). Falls back to 0001.cha, 0002.cha, etc. when the data was parsed from in-memory strings.

Raises:
  • ValueError – If filenames count doesn’t match file count.

  • IOError – If writing fails.

to_srt(*, participants: Sequence[str] | None = None) SRT

Convert to an SRT object.

Each CHAT file produces one SRT file. When multiple participants are present, subtitle text is prefixed with the participant code (e.g., "CHI: more cookie ."). Utterances without time marks are skipped.

Parameters:

participants – Participant codes to include. If None, all participants are included.

Returns:

A SRT object.

to_srt_files(*, participants: Sequence[str] | None = None, filenames: Sequence[str] | None = None) None

Write SRT (.srt) files to a directory.

Parameters:
  • dir_path – Directory path to write .srt files to.

  • participants – Participant codes to include. If None, all participants are included.

  • filenames – Custom filenames for the output files. If None, filenames are derived from the original source file paths with the extension changed to .srt.

Raises:
  • ValueError – If filenames count doesn’t match file count.

  • IOError – If writing fails.

to_srt_strs(*, participants: Sequence[str] | None = None) list[str]

Return SRT format strings, one per file.

Parameters:

participants – Participant codes to include. If None, all participants are included. Utterances without time marks are skipped.

Returns:

A list of SRT-formatted strings.

to_strs() list[str]

Return CHAT data strings, one per file.

Returns:

A list of CHAT-formatted strings.

to_textgrid(*, participants: Sequence[str] | None = None) TextGrid

Convert to a TextGrid object.

to_textgrid_files(*, participants: Sequence[str] | None = None, filenames: Sequence[str] | None = None) None

Write TextGrid (.TextGrid) files to a directory.

to_textgrid_strs(*, participants: Sequence[str] | None = None) list[str]

Return TextGrid format strings, one per file.

tokens(*, by_utterance: bool = False, by_file: bool = False) list[Token] | list[list[Token]] | list[list[list[Token]]]

Return tokens.

Parameters:
  • by_utterance – If True, group tokens by utterance.

  • by_file – If True, group tokens by file.

Returns:

Tokens with optional grouping.

ttr(*, participant: str = 'CHI', n: int | None = 350) list[float]

Type-token ratio for non-punctuation words.

Parameters:
  • participant – Target participant code.

  • n – Number of tokens to use per file. None for all.

Returns:

One value per file.

utterances(*, by_file: bool = False) list[Utterance] | list[list[Utterance]]

Return utterances.

Parameters:

by_file – If True, group utterances by file.

Returns:

Utterances, optionally grouped by file.

word_ngrams(n: int) Ngrams

Return an Ngrams for word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:

n – The n-gram order (1 for unigrams, 2 for bigrams, etc.).

Returns:

An Ngrams with the accumulated counts.

Raises:

ValueError – If n < 1.

words(*, by_utterance: bool = False, by_file: bool = False) list[str] | list[list[str]] | list[list[list[str]]]

Return words.

Parameters:
  • by_utterance – If True, group words by utterance.

  • by_file – If True, group words by file.

Returns:

Words with optional grouping.

class pylangacq.ChangeableHeader

A changeable header that can appear mid-file in CHAT transcripts.

Variants: Activities, Bck, Bg, Blank, Comment, Date, Eg, G, NewEpisode, Page, Situation.

class pylangacq.Gra(dep, head, rel)

A grammatical relation from the %gra tier.

dep: int

Position of the dependent word.

head: int

Position of the head word.

rel: str

Grammatical relation type.

class pylangacq.Headers

File-level headers from a CHAT file.

pid: str | None

Persistent identifier from @PID.

languages: list[str]

Language codes from @Languages.

participants: list[Participant]

Participants from @Participants and @ID.

options: str | None

Options from @Options.

media: dict[str, str | None] | None

Media descriptor from @Media as a dict with keys “filename”, “format”, and “status”.

date: datetime.date | None

Date from @Date, parsed as a date object. The CHAT format DD-MMM-YYYY (e.g., 25-JAN-1983) is tried first, then ISO YYYY-MM-DD. If neither format matches, the value is None.

location: str | None

Location from @Location.

number: str | None

Number of participants from @Number.

recording_quality: str | None

Recording quality from @Recording Quality.

room_layout: str | None

Room layout from @Room Layout.

tape_location: str | None

Tape location from @Tape Location.

time_duration: str | None

Time duration from @Time Duration.

time_start: str | None

Time start from @Time Start.

transcriber: str | None

Transcriber from @Transcriber.

transcription: str | None

Transcription type from @Transcription.

types: str | None

Types from @Types.

videos: str | None

Videos from @Videos.

warning: str | None

Warning from @Warning.

situation: str | None

Situation from @Situation.

comments: list[str] | None

All @Comment values from the header section, in order.

other: dict[str, str]

Unrecognized headers as key-value pairs.

class pylangacq.Ngrams(n: int, *, min_n: int | None = None) None

An counter for storing n-grams efficiently and counting their frequencies.

Accumulates n-gram counts from sequences of elements. N-grams do not cross sequence boundaries.

clear() None

Clear all counts.

count(seq: Sequence[str]) None

Count n-grams from a single sequence.

Parameters:

seq – A sequence of elements to extract n-grams from.

count_seqs(seqs: Sequence[Sequence[str]]) None

Count n-grams from multiple sequences.

Parameters:

seqs – An iterable of sequences.

get(ngram: Sequence[str]) int

Return the count for a specific n-gram.

Parameters:

ngram – The n-gram to look up.

Returns:

The count, or 0 if not observed.

items(*, order: int | None = None) list[tuple[tuple[str, ...], int]]

Return all (n-gram, count) pairs.

Parameters:

order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).

Returns:

A list of (ngram_tuple, count) pairs.

Raises:

ValueError – If order is out of range.

min_n

The minimum n-gram order.

most_common(n: int | None = None, *, order: int | None = None) list[tuple[tuple[str, ...], int]]

Return the n most common n-grams with their counts.

Parameters:
  • n – Number of top entries to return. If None, returns all n-grams sorted by count (descending).

  • order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).

Returns:

A list of (ngram_tuple, count) pairs sorted by count.

Raises:

ValueError – If order is out of range.

n

The n-gram order.

to_counter(*, order: int | None = None) Counter[tuple[str, ...]]

Convert to a collections.Counter.

Parameters:

order – If specified, only include n-grams of this specific order. Must be between min_n and n (inclusive). If None, defaults to the highest order (n).

Returns:

A Counter mapping n-gram tuples to their counts.

Raises:

ValueError – If order is out of range.

total(*, order: int | None = None) int

Return the total number of n-gram tokens counted.

Parameters:

order – If specified, return total for this specific order only. Must be between min_n and n (inclusive). If None, returns the sum across all orders.

Returns:

Total count.

Raises:

ValueError – If order is out of range.

class pylangacq.Participant

A participant from @Participants and @ID headers.

code: str

Three-letter speaker ID (e.g., “CHI”).

name: str

Speaker name (may be empty).

role: str

Standard role (e.g., “Target_Child”).

language: str | None

Language from @ID.

corpus: str | None

Corpus name from @ID.

age: Age | None

Age from @ID.

sex: str | None

Sex from @ID.

group: str | None

Group from @ID.

ses: str | None

Ethnicity/SES from @ID.

education: str | None

Education level from @ID.

custom: str | None

Custom field from @ID.

birth: str | None

Birth date from @Birth of header.

birthplace: str | None

Birthplace from @Birthplace of header.

l1: str | None

First language from @L1 of header.

class pylangacq.Token(word, pos=None, mor=None, gra=None)

A single token from a CoNLL-U file (10 tab-separated fields).

class pylangacq.Utterance(*, index: int, line: str, time_marks: tuple[int, int]) None

A single subtitle block within an SRT file.

audible

Audibly faithful transcript of this utterance, or None for headers.

to_str()

Return a plain text tabular representation of this utterance.

class pylangacq.Utterances

A sequence of utterances with formatted display.

Returned by CHAT.head() and CHAT.tail(). Displays as column-aligned plain text in the terminal and as HTML tables in Jupyter notebooks.