API Reference¶

Read CHAT data.

Parameters:

path – Path to a .zip file, a local directory containing .cha files, or a single .cha file.
filter_files – Filename(s) to keep. Regular expression matching is supported. If None, all files are included.
filter_participants – Participant code(s) to keep. Regular expression matching is supported. If None, all participants are included.
cls – The class used to create the reader. Must be CHAT or a subclass of it.
strict – If True, enforce strict parsing of the CHAT data.

Returns:

A CHAT instance filtered by the specified files and participants.

Raises:

TypeError – If cls is not CHAT or a subclass of it.
ValueError – If path does not point to a .zip file, a directory, or a .cha file.

class pylangacq.Age¶

Age in the CHAT format: years;months.days.

years: int¶: Number of years.

months: int | None¶: Number of months.

days: int | None¶: Number of days.

in_months() → float¶: Return the age in total months as a float.

class pylangacq.CHAT¶

CHAT data reader for CHILDES/TalkBank transcripts.

This class parses CHAT transcription files and provides access to utterances, tokens, words, and annotations.

ages() → list[Age | None]¶

Return the age of the target child (CHI) in each file.

Returns:: One Age per file, or None if the file has no CHI or the CHI has no age.

append() → None¶

Append data from another CHAT reader.

Parameters:: other – A CHAT reader whose data to append.

append_left() → None¶

Left-append data from another CHAT reader.

Parameters:: other – A CHAT reader whose data to prepend.

clear() → None¶: Remove all data from this reader.

extend() → None¶

Extend data from multiple CHAT readers.

Parameters:: others – CHAT readers whose data to append.

extend_left() → None¶

Left-extend data from multiple CHAT readers.

Parameters:: others – CHAT readers whose data to prepend.

file_paths¶

Return the list of file paths.

Returns:: File paths or identifiers.

filter(*, files: str | Sequence[str] | None = None, participants: str | Sequence[str] | None = None) → CHAT¶

Return a new CHAT filtered by file path and/or participant regex.

Parameters:

files – Regex pattern(s) to include only matching file paths. Accepts a single string or a sequence of strings. Multiple patterns are OR’d.
participants – Regex pattern(s) to include only matching participant codes. Accepts a single string or a sequence of strings. Patterns are auto-anchored (full match). Multiple patterns are OR’d.

Returns:

A new filtered CHAT reader.

classmethod from_dir(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True) → CHAT¶

Recursively load CHAT data from a directory.

Parameters:

path – Directory path to search.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_files(paths: Sequence[str | os.PathLike[str]], *, parallel: bool = True, strict: bool = True) → CHAT¶

Load CHAT data from file paths.

Parameters:

paths – Paths to CHAT files.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

classmethod from_strs(strs: Sequence[str], ids: Sequence[str] | None = None, parallel: bool = True, strict: bool = True) → CHAT¶

Parse CHAT data from in-memory strings.

Parameters:

strs – CHAT-formatted strings to parse.
ids – Optional identifiers for each string. If None, UUIDs are generated.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strs and ids have different lengths, or if strict is True and mor/word misalignment is found.

classmethod from_utterances(utterances: Sequence[Utterance]) → CHAT¶

Construct a CHAT reader from a list of utterances.

Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries. Raw lines are synthesized from each utterance’s tiers data, so to_strs() and to_chat() produce valid CHAT output.

Parameters:: utterances – Utterance objects to include.
Returns:: A new CHAT reader containing the given utterances.

classmethod from_zip(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True) → CHAT¶

Load CHAT data from a ZIP archive.

Parameters:

path – Path to the ZIP file.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.

Returns:

A new CHAT reader with the parsed data.

Raises:

ValueError – If strict is True and mor/word misalignment is found.

head(n: int = 5) → Utterances¶

Return the first n utterances with a formatted display.

Parameters:: n – Number of utterances to include.
Returns:: An Utterances object that displays as formatted text.

headers() → list[Headers]¶

Return file-level headers.

Returns:: A list of Headers, one per file.

info(*, verbose: bool = False) → None¶

Print a summary of this reader’s data.

Parameters:: verbose – If True, show the details of all files. Defaults to False (shows first 5 files only).

ipsyn(*, participant: str = 'CHI', n: int | None = 100) → list[int]¶

Index of Productive Syntax (IPSyn).

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One score (0-112) per file.

languages(*, by_file: bool = False) → list[str] | list[list[str]]¶

Return languages.

Parameters:: by_file – If True, group languages by file.
Returns:: Language codes, optionally grouped by file.

mlu(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in morphemes.

Alias for mlum().

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mlum(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in morphemes.

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

mluw(*, participant: str = 'CHI', n: int | None = 100) → list[float]¶

Mean length of utterance in words.

Parameters:

participant – Target participant code.
n – Number of utterances to use per file. None for all.

Returns:

One value per file.

n_files¶

Return the number of files.

Returns:: Number of loaded files.

participants(*, by_file: bool = False) → list[Participant] | list[list[Participant]]¶

Return participants.

Parameters:: by_file – If True, group participants by file.
Returns:: Participants, optionally grouped by file.

pop() → CHAT¶

Remove and return the last file as a new CHAT reader.

Returns:: A new CHAT reader containing the removed file.
Raises:: IndexError – If the reader is empty.

pop_left() → CHAT¶

Remove and return the first file as a new CHAT reader.

Returns:: A new CHAT reader containing the removed file.
Raises:: IndexError – If the reader is empty.

tail(n: int = 5) → Utterances¶

Return the last n utterances with a formatted display.

Parameters:: n – Number of utterances to include.
Returns:: An Utterances object that displays as formatted text.

to_chat(path: str | os.PathLike[str], *, is_dir: bool = False, filenames: Sequence[str] | None = None) → None¶

Write CHAT data to disk.

Parameters:

path – Output file path (or directory if is_dir is True).
is_dir – If True, write multiple files to the directory.
filenames – Custom filenames when writing to a directory. If None, uses 0001.cha, 0002.cha, etc.

Raises:

ValueError – If the reader has multiple files but is_dir is False, or if filenames count doesn’t match file count.
IOError – If writing fails.

to_strs() → list[str]¶

Return CHAT data strings, one per file.

Returns:: A list of CHAT-formatted strings.

tokens(*, by_utterance: bool = False, by_file: bool = False) → list[Token] | list[list[Token]] | list[list[list[Token]]]¶

Return tokens.

Parameters:

by_utterance – If True, group tokens by utterance.
by_file – If True, group tokens by file.

Returns:

Tokens with optional grouping.

ttr(*, participant: str = 'CHI', n: int | None = 350) → list[float]¶

Type-token ratio for non-punctuation words.

Parameters:

participant – Target participant code.
n – Number of tokens to use per file. None for all.

Returns:

One value per file.

utterances(*, by_file: bool = False) → list[Utterance] | list[list[Utterance]]¶

Return utterances.

Parameters:: by_file – If True, group utterances by file.
Returns:: Utterances, optionally grouped by file.

word_ngrams(n: int) → Ngrams¶

Return an Ngrams for word n-grams across all utterances.

N-grams do not cross utterance boundaries.

Parameters:: n – The n-gram order (1 for unigrams, 2 for bigrams, etc.).
Returns:: An Ngrams with the accumulated counts.
Raises:: ValueError – If n < 1.

words(*, by_utterance: bool = False, by_file: bool = False) → list[str] | list[list[str]] | list[list[list[str]]]¶

Return words.

Parameters:

by_utterance – If True, group words by utterance.
by_file – If True, group words by file.

Returns:

Words with optional grouping.

class pylangacq.ChangeableHeader¶

A changeable header that can appear mid-file in CHAT transcripts.

Variants: Activities, Bck, Bg, Blank, Comment, Date, Eg, G, NewEpisode, Page, Situation.

class pylangacq.Gra(dep, head, rel)¶

A grammatical relation from the %gra tier.

dep: int¶: Position of the dependent word.

head: int¶: Position of the head word.

rel: str¶: Grammatical relation type.

class pylangacq.Headers¶

File-level headers from a CHAT file.

pid: str | None¶: Persistent identifier from @PID.

languages: list[str]¶: Language codes from @Languages.

participants: list[Participant]¶: Participants from @Participants and @ID.

options: str | None¶: Options from @Options.

media: dict[str, str | None] | None¶: Media descriptor from @Media as a dict with keys “filename”, “format”, and “status”.

date: str | None¶: Date from @Date.

location: str | None¶: Location from @Location.

number: str | None¶: Number of participants from @Number.

recording_quality: str | None¶: Recording quality from @Recording Quality.

room_layout: str | None¶: Room layout from @Room Layout.

tape_location: str | None¶: Tape location from @Tape Location.

time_duration: str | None¶: Time duration from @Time Duration.

time_start: str | None¶: Time start from @Time Start.

transcriber: str | None¶: Transcriber from @Transcriber.

transcription: str | None¶: Transcription type from @Transcription.

types: str | None¶: Types from @Types.

videos: str | None¶: Videos from @Videos.

warning: str | None¶: Warning from @Warning.

situation: str | None¶: Situation from @Situation.

comments: list[str] | None¶: All @Comment values from the header section, in order.

other: dict[str, str]¶: Unrecognized headers as key-value pairs.

class pylangacq.Ngrams(n: int, *, min_n: int | None = None) → None¶

An counter for storing n-grams efficiently and counting their frequencies.

Accumulates n-gram counts from sequences of elements. N-grams do not cross sequence boundaries.

clear() → None¶: Clear all counts.

count(seq: Sequence[str]) → None¶

Count n-grams from a single sequence.

Parameters:: seq – A sequence of elements to extract n-grams from.

count_seqs(seqs: Sequence[Sequence[str]]) → None¶

Count n-grams from multiple sequences.

Parameters:: seqs – An iterable of sequences.

get(ngram: Sequence[str]) → int¶

Return the count for a specific n-gram.

Parameters:: ngram – The n-gram to look up.
Returns:: The count, or 0 if not observed.

items(*, order: int | None = None) → list[tuple[tuple[str, ...], int]]¶

Return all (n-gram, count) pairs.

Parameters:: order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).
Returns:: A list of (ngram_tuple, count) pairs.
Raises:: ValueError – If order is out of range.

min_n¶: The minimum n-gram order.

most_common(n: int | None = None, *, order: int | None = None) → list[tuple[tuple[str, ...], int]]¶

Return the n most common n-grams with their counts.

Parameters:

n – Number of top entries to return. If None, returns all n-grams sorted by count (descending).
order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).

Returns:

A list of (ngram_tuple, count) pairs sorted by count.

Raises:

ValueError – If order is out of range.

n¶: The n-gram order.

to_counter(*, order: int | None = None) → Counter[tuple[str, ...]]¶

Convert to a collections.Counter.

Parameters:: order – If specified, only include n-grams of this specific order. Must be between min_n and n (inclusive). If None, defaults to the highest order (n).
Returns:: A Counter mapping n-gram tuples to their counts.
Raises:: ValueError – If order is out of range.

total(*, order: int | None = None) → int¶

Return the total number of n-gram tokens counted.

Parameters:: order – If specified, return total for this specific order only. Must be between min_n and n (inclusive). If None, returns the sum across all orders.
Returns:: Total count.
Raises:: ValueError – If order is out of range.

class pylangacq.Participant¶

A participant from @Participants and @ID headers.

code: str¶: Three-letter speaker ID (e.g., “CHI”).

name: str¶: Speaker name (may be empty).

role: str¶: Standard role (e.g., “Target_Child”).

language: str | None¶: Language from @ID.

corpus: str | None¶: Corpus name from @ID.

age: Age | None¶: Age from @ID.

sex: str | None¶: Sex from @ID.

group: str | None¶: Group from @ID.

ses: str | None¶: Ethnicity/SES from @ID.

education: str | None¶: Education level from @ID.

custom: str | None¶: Custom field from @ID.

birth: str | None¶: Birth date from @Birth of header.

birthplace: str | None¶: Birthplace from @Birthplace of header.

l1: str | None¶: First language from @L1 of header.

class pylangacq.Token(word, pos=None, mor=None, gra=None)¶

A token with word, POS, morphology, and grammatical relation.

word: str¶: The word form.

pos: str | None¶: Part-of-speech tag from the %mor tier.

mor: str | None¶: Morphological information from the %mor tier.

gra: Gra | None¶: Grammatical relation from the %gra tier.

class pylangacq.Utterance(*, participant=None, tokens=None, time_marks=None, tiers=None, changeable_header=None)¶

A single utterance from a CHAT transcript.

For changeable headers (e.g., @Comment, @New Episode), only changeable_header is set; all other fields are None.

participant: str | None¶: Speaker code (e.g., “CHI”, “MOT”), or None for headers.

tokens: list[Token] | None¶: List of tokens in this utterance, or None for headers.

raw: str | None¶: Raw transcript of this utterance, or None for headers.

time_marks: tuple[int, int] | None¶: Start and end timestamps in milliseconds.

tiers: dict[str, str] | None¶: Raw tier data including the main tier and dependent tiers, or None for headers.

changeable_header: ChangeableHeader | None¶: The header variant if this is a changeable header, or None for real utterances.

to_str() → str¶

Return a plain text tabular representation of this utterance.

Returns:: A column-aligned string with participant words, %mor, %gra, other tiers, and time marks. For changeable headers, returns the CHAT-format string (e.g., @Comment:\tChild laughs).

class pylangacq.Utterances¶

A sequence of utterances with formatted display.

Returned by CHAT.head() and CHAT.tail(). Displays as column-aligned plain text in the terminal and as HTML tables in Jupyter notebooks.