API Reference¶
- pylangacq.read_chat(path: str | os.PathLike[str], *, filter_files: str | typing.Sequence[str] | None = None, filter_participants: str | typing.Sequence[str] | None = None, cls: type[CHAT] = <class 'builtins.CHAT'>, strict: bool = True) CHAT¶
Read CHAT data.
- Parameters:
path – Path to a
.zipfile, a local directory containing.chafiles, or a single.chafile.filter_files – Filename(s) to keep. Regular expression matching is supported. If
None, all files are included.filter_participants – Participant code(s) to keep. Regular expression matching is supported. If
None, all participants are included.cls – The class used to create the reader. Must be
CHATor a subclass of it.strict – If
True, enforce strict parsing of the CHAT data.
- Returns:
A
CHATinstance filtered by the specified files and participants.- Raises:
TypeError – If cls is not
CHATor a subclass of it.ValueError – If path does not point to a
.zipfile, a directory, or a.chafile.
- class pylangacq.Age¶
Age in the CHAT format: years;months.days.
- class pylangacq.CHAT¶
CHAT data reader for CHILDES/TalkBank transcripts.
This class parses CHAT transcription files and provides access to utterances, tokens, words, and annotations.
- ages() list[Age | None]¶
Return the age of the target child (CHI) in each file.
- Returns:
One Age per file, or None if the file has no CHI or the CHI has no age.
- append() None¶
Append data from another CHAT reader.
- Parameters:
other – A CHAT reader whose data to append.
- append_left() None¶
Left-append data from another CHAT reader.
- Parameters:
other – A CHAT reader whose data to prepend.
- extend() None¶
Extend data from multiple CHAT readers.
- Parameters:
others – CHAT readers whose data to append.
- extend_left() None¶
Left-extend data from multiple CHAT readers.
- Parameters:
others – CHAT readers whose data to prepend.
- file_paths¶
Return the list of file paths.
- Returns:
File paths or identifiers.
- filter(*, files: str | Sequence[str] | None = None, participants: str | Sequence[str] | None = None) CHAT¶
Return a new CHAT filtered by file path and/or participant regex.
- Parameters:
files – Regex pattern(s) to include only matching file paths. Accepts a single string or a sequence of strings. Multiple patterns are OR’d.
participants – Regex pattern(s) to include only matching participant codes. Accepts a single string or a sequence of strings. Patterns are auto-anchored (full match). Multiple patterns are OR’d.
- Returns:
A new filtered CHAT reader.
- classmethod from_dir(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True) CHAT¶
Recursively load CHAT data from a directory.
- Parameters:
path – Directory path to search.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
- Returns:
A new CHAT reader with the parsed data.
- Raises:
ValueError – If strict is True and mor/word misalignment is found.
- classmethod from_files(paths: Sequence[str | os.PathLike[str]], *, parallel: bool = True, strict: bool = True) CHAT¶
Load CHAT data from file paths.
- Parameters:
paths – Paths to CHAT files.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
- Returns:
A new CHAT reader with the parsed data.
- Raises:
ValueError – If strict is True and mor/word misalignment is found.
- classmethod from_strs(strs: Sequence[str], ids: Sequence[str] | None = None, parallel: bool = True, strict: bool = True) CHAT¶
Parse CHAT data from in-memory strings.
- Parameters:
strs – CHAT-formatted strings to parse.
ids – Optional identifiers for each string. If None, UUIDs are generated.
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
- Returns:
A new CHAT reader with the parsed data.
- Raises:
ValueError – If strs and ids have different lengths, or if strict is True and mor/word misalignment is found.
- classmethod from_utterances(utterances: Sequence[Utterance]) CHAT¶
Construct a CHAT reader from a list of utterances.
Creates a new reader containing a single virtual file with the given utterances. Useful for splitting a reader into sub-readers based on utterance boundaries. Raw lines are synthesized from each utterance’s
tiersdata, soto_strs()andto_chat()produce valid CHAT output.- Parameters:
utterances – Utterance objects to include.
- Returns:
A new CHAT reader containing the given utterances.
- classmethod from_zip(path: str | os.PathLike[str], *, match: str | None = None, extension: str = '.cha', parallel: bool = True, strict: bool = True) CHAT¶
Load CHAT data from a ZIP archive.
- Parameters:
path – Path to the ZIP file.
match – Regex pattern to include only matching file paths.
extension – File extension to filter by (default: “.cha”).
parallel – If True, use parallel processing. Set to False to disable multithreading.
strict – If True (default), raise ValueError on mor/word misalignment. If False, emit a warning and set tokens to an empty list for affected utterances.
- Returns:
A new CHAT reader with the parsed data.
- Raises:
ValueError – If strict is True and mor/word misalignment is found.
- head(n: int = 5) Utterances¶
Return the first n utterances with a formatted display.
- Parameters:
n – Number of utterances to include.
- Returns:
An Utterances object that displays as formatted text.
- info(*, verbose: bool = False) None¶
Print a summary of this reader’s data.
- Parameters:
verbose – If True, show the details of all files. Defaults to False (shows first 5 files only).
- ipsyn(*, participant: str = 'CHI', n: int | None = 100) list[int]¶
Index of Productive Syntax (IPSyn).
- Parameters:
participant – Target participant code.
n – Number of utterances to use per file. None for all.
- Returns:
One score (0-112) per file.
- languages(*, by_file: bool = False) list[str] | list[list[str]]¶
Return languages.
- Parameters:
by_file – If True, group languages by file.
- Returns:
Language codes, optionally grouped by file.
- mlu(*, participant: str = 'CHI', n: int | None = 100) list[float]¶
Mean length of utterance in morphemes.
Alias for
mlum().- Parameters:
participant – Target participant code.
n – Number of utterances to use per file. None for all.
- Returns:
One value per file.
- mlum(*, participant: str = 'CHI', n: int | None = 100) list[float]¶
Mean length of utterance in morphemes.
- Parameters:
participant – Target participant code.
n – Number of utterances to use per file. None for all.
- Returns:
One value per file.
- mluw(*, participant: str = 'CHI', n: int | None = 100) list[float]¶
Mean length of utterance in words.
- Parameters:
participant – Target participant code.
n – Number of utterances to use per file. None for all.
- Returns:
One value per file.
- n_files¶
Return the number of files.
- Returns:
Number of loaded files.
- participants(*, by_file: bool = False) list[Participant] | list[list[Participant]]¶
Return participants.
- Parameters:
by_file – If True, group participants by file.
- Returns:
Participants, optionally grouped by file.
- pop() CHAT¶
Remove and return the last file as a new CHAT reader.
- Returns:
A new CHAT reader containing the removed file.
- Raises:
IndexError – If the reader is empty.
- pop_left() CHAT¶
Remove and return the first file as a new CHAT reader.
- Returns:
A new CHAT reader containing the removed file.
- Raises:
IndexError – If the reader is empty.
- tail(n: int = 5) Utterances¶
Return the last n utterances with a formatted display.
- Parameters:
n – Number of utterances to include.
- Returns:
An Utterances object that displays as formatted text.
- to_chat(path: str | os.PathLike[str], *, is_dir: bool = False, filenames: Sequence[str] | None = None) None¶
Write CHAT data to disk.
- Parameters:
path – Output file path (or directory if is_dir is True).
is_dir – If True, write multiple files to the directory.
filenames – Custom filenames when writing to a directory. If None, uses 0001.cha, 0002.cha, etc.
- Raises:
ValueError – If the reader has multiple files but is_dir is False, or if filenames count doesn’t match file count.
IOError – If writing fails.
- to_strs() list[str]¶
Return CHAT data strings, one per file.
- Returns:
A list of CHAT-formatted strings.
- tokens(*, by_utterance: bool = False, by_file: bool = False) list[Token] | list[list[Token]] | list[list[list[Token]]]¶
Return tokens.
- Parameters:
by_utterance – If True, group tokens by utterance.
by_file – If True, group tokens by file.
- Returns:
Tokens with optional grouping.
- ttr(*, participant: str = 'CHI', n: int | None = 350) list[float]¶
Type-token ratio for non-punctuation words.
- Parameters:
participant – Target participant code.
n – Number of tokens to use per file. None for all.
- Returns:
One value per file.
- utterances(*, by_file: bool = False) list[Utterance] | list[list[Utterance]]¶
Return utterances.
- Parameters:
by_file – If True, group utterances by file.
- Returns:
Utterances, optionally grouped by file.
- word_ngrams(n: int) Ngrams¶
Return an Ngrams for word n-grams across all utterances.
N-grams do not cross utterance boundaries.
- Parameters:
n – The n-gram order (1 for unigrams, 2 for bigrams, etc.).
- Returns:
An Ngrams with the accumulated counts.
- Raises:
ValueError – If n < 1.
- class pylangacq.ChangeableHeader¶
A changeable header that can appear mid-file in CHAT transcripts.
Variants: Activities, Bck, Bg, Blank, Comment, Date, Eg, G, NewEpisode, Page, Situation.
- class pylangacq.Gra(dep, head, rel)¶
A grammatical relation from the %gra tier.
- class pylangacq.Headers¶
File-level headers from a CHAT file.
- participants: list[Participant]¶
Participants from @Participants and @ID.
- class pylangacq.Ngrams(n: int, *, min_n: int | None = None) None¶
An counter for storing n-grams efficiently and counting their frequencies.
Accumulates n-gram counts from sequences of elements. N-grams do not cross sequence boundaries.
- count(seq: Sequence[str]) None¶
Count n-grams from a single sequence.
- Parameters:
seq – A sequence of elements to extract n-grams from.
- count_seqs(seqs: Sequence[Sequence[str]]) None¶
Count n-grams from multiple sequences.
- Parameters:
seqs – An iterable of sequences.
- get(ngram: Sequence[str]) int¶
Return the count for a specific n-gram.
- Parameters:
ngram – The n-gram to look up.
- Returns:
The count, or 0 if not observed.
- items(*, order: int | None = None) list[tuple[tuple[str, ...], int]]¶
Return all (n-gram, count) pairs.
- Parameters:
order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).
- Returns:
A list of (ngram_tuple, count) pairs.
- Raises:
ValueError – If order is out of range.
- min_n¶
The minimum n-gram order.
- most_common(n: int | None = None, *, order: int | None = None) list[tuple[tuple[str, ...], int]]¶
Return the n most common n-grams with their counts.
- Parameters:
n – Number of top entries to return. If None, returns all n-grams sorted by count (descending).
order – If specified, only return n-grams of this specific order. Must be between min_n and n (inclusive).
- Returns:
A list of (ngram_tuple, count) pairs sorted by count.
- Raises:
ValueError – If order is out of range.
- n¶
The n-gram order.
- to_counter(*, order: int | None = None) Counter[tuple[str, ...]]¶
Convert to a
collections.Counter.- Parameters:
order – If specified, only include n-grams of this specific order. Must be between min_n and n (inclusive). If None, defaults to the highest order (n).
- Returns:
A Counter mapping n-gram tuples to their counts.
- Raises:
ValueError – If order is out of range.
- total(*, order: int | None = None) int¶
Return the total number of n-gram tokens counted.
- Parameters:
order – If specified, return total for this specific order only. Must be between min_n and n (inclusive). If None, returns the sum across all orders.
- Returns:
Total count.
- Raises:
ValueError – If order is out of range.
- class pylangacq.Participant¶
A participant from @Participants and @ID headers.
- class pylangacq.Token(word, pos=None, mor=None, gra=None)¶
A token with word, POS, morphology, and grammatical relation.
- class pylangacq.Utterance(*, participant=None, tokens=None, time_marks=None, tiers=None, changeable_header=None)¶
A single utterance from a CHAT transcript.
For changeable headers (e.g.,
@Comment,@New Episode), onlychangeable_headeris set; all other fields areNone.- tiers: dict[str, str] | None¶
Raw tier data including the main tier and dependent tiers, or None for headers.
- changeable_header: ChangeableHeader | None¶
The header variant if this is a changeable header, or None for real utterances.
- class pylangacq.Utterances¶
A sequence of utterances with formatted display.
Returned by
CHAT.head()andCHAT.tail(). Displays as column-aligned plain text in the terminal and as HTML tables in Jupyter notebooks.