mltype package¶

Submodules¶

mltype.base module¶

Building blocks.

class mltype.base.Action(pressed_key, status, ts)[source]¶

Bases: object

Representation of one keypress.

Parameters

pressed_key (str) – What key was pressed. We define a convention that pressing a backspace will be represented as pressed_key=None.
status (int) –
What was the status AFTER pushing the key. It should be one of the following integers:
- STATUS_BACKSPACE
- STATUS_CORRECT
- STATUS_WRONG
ts (datetime) – The timestamp corresponding to this action.

class mltype.base.TypedText(text)[source]¶

Bases: object

Abstraction that represenets the text that needs to be typed.

Parameters: text (str) – Text that needs to be typed.

actions¶

List of lists of Action instances of length equal to len(text). It logs per character all actions that have been taken on it.

Type: list

start_ts¶

Timestamp of when the first action was performed (not the time of initialization).

Type: datetime or None

end_ts¶

Timestamp of when the last action was taken. Note that it is the action that lead to the text being correctly typed in it’s entirity.

Type: datetime or None

check_finished(force_perfect=True)[source]¶

Determine whether the typing has been finished successfully.

Parameters: force_perfect (bool) – If True, one can only finished if all the characters were typed correctly. Otherwise, all characters need to be either correct or wrong.

compute_accuracy()[source]¶: Compute the accuracy of the typing.

compute_cpm()[source]¶: Compute characters per minute.

compute_wpm(word_size=5)[source]¶: Compute words per minute.

property elapsed_seconds¶: Get the number of seconds elapsed from the first action.

classmethod load(path)[source]¶

Load a pickled file.

Parameters: path (pathlib.Path) – Path to the pickle file.
Returns: typed_text – Instance of the TypedText
Return type: TypedText

property n_actions¶: Get the number of actions that have been taken.

property n_backspace_actions¶: Get the number of backspace actions.

property n_backspace_characters¶: Get the number of characters that have been backspaced.

property n_characters¶: Get the number of characters in the text.

property n_correct_characters¶: Get the number of characters that have been typed correctly.

property n_untouched_characters¶: Get the number of characters that have not been touched yet.

property n_wrong_characters¶: Get the number of characters that have been typed wrongly.

save(path)[source]¶

Save internal state of this TypedText.

Can be loaded via the class method load.

Parameters: path (pathlib.Path) – Where the .rlt file will be store.

type_character(i, ch=None)[source]¶

Type one single character.

Parameters

i (int) – Index of the character in the text.
ch (str or None) – The character that was typed. Note that if None then we assume that the user used backspace.

unroll_actions()[source]¶

Export actions in an order they appeared.

Returns: res – List of tuples of (ix_char, Action(..))
Return type: list

mltype.cli module¶

Command line interface.

mltype.data module¶

Data creating and managing.

mltype.data.file2text(filepath, verbose=True)[source]¶

Read all lines of a file into a string.

Note that we destroy all the new line characters and all the whitespace charecters on both ends of the line. Note that this is very radical for source code of programming languages or similar.

Parameters

filepath (pathlib.Path) – Path to the file
verbose (bool) – If True, we print the name of the file.

Returns

text – All the text found in the input file.

Return type

str

mltype.data.folder2text(folderpath, valid_extensions=None)[source]¶: Collect all files recursively and read into a list of strings.

mltype.interactive module¶

Module implementing interaction logic.

class mltype.interactive.Cursor(stdscr)[source]¶

Bases: object

Utility class that can locate and modify the position of a cursor.

move_abs(y, x)[source]¶

Move absolutely to cooordinates.

Note that if the column coordinate x is out of the screen then we automatically move to differnt row.

y, xint: New coordinates where to move the cursor to.

class mltype.interactive.Pen(font, background, i)[source]¶

Bases: object

Represents background and font color.

addch(stdscr, y, x, text)[source]¶

Add a single character.

Parameters

stdscr (curses.Window) – Window in which we add the character.
y (int) – Position of the character.
x (int) – Position of the character.
text (str) – Single element string representing the character.

addstr(stdscr, y, x, text)[source]¶

Add a string.

Parameters

stdscr (curses.Window) – Window in which we add the character.
y (int) – Position of the string.
x (int) – Position of the string.
text (str) – String to put to the screen.

class mltype.interactive.TypedTextWriter(tt, stdscr, y_start=0, x_start=0, replay_tt=None, target_wpm=None)[source]¶

Bases: object

Curses writer that uses the TypedText object.

We make an assumption that the x and y position of the starting character stay the same.

Parameters

tt (TypedText) – Text that the user is going to type.
stdscr (curses.Window) – Main curses window.
y_start (int) – Coordinates of the first character.
x_start (int) – Coordinates of the first character.
replay_tt (TypedText or None) – If provided, it represents a previously typed text that we want to dynamically plot together with the current typing.

current_ix¶

Represents the index of the character of self.tt.text that we are about to type. Note this is exactly the character on which the cursor will be lying.

Type: int

pens¶

The keys are integers representing different statuses. The values are Pen objects representing how to format a character with a given status. Note that if replay_tt is provided we add a new entry “replay” and it represents the style of replay character.

Type: dict

replay_uactions¶

The unrolled actions of the replay.

Type: list

replay_elapsed¶

The same length as replay_uactions. It stores the elapsed times (since the start) of all the actions. Note that it is going to be sorted in an ascending order and we can do binary search on it.

Type: list

target_wpm¶

If specified, we display the uniform run that leads to that speed.

Type: int or None

process_character()[source]¶: Process an entered character.

render()[source]¶: Render the entire screen.

property screen_status¶

Get screen information.

Returns

i_start (int) – Integer representing the number of cells away from the start we are.
height, width (int) – Height, width of the screen. Note that user my resize during a session.

mltype.interactive.main_basic(text, force_perfect, output_file, instant_death, target_wpm)[source]¶

Run main curses loop with no previous replay.

Parameters

force_perfect (bool) – If True, then one cannot finish typing before all characters are typed without any mistakes.
output_file (str or pathlib.Path or None) – If pathlib.Path then we store the typed text in this file. If None, no saving is taking place.
instant_death (bool) – If active, the first mistake will end the game.
target_wpm (int or None) – The desired speed to be displayed as a guide.

mltype.interactive.main_replay(replay_file, force_perfect, overwrite, instant_death, target_wpm)[source]¶

Run main curses loop with a replay.

Parameters: force_perfect (bool) – If True, then one cannot finish typing before all characters are typed without any mistakes.

overwritebool: If True, the replay file will be overwritten in case we are faster than it.
replay_filestr or pathlib.Path: Typed text in this file from some previous game.
instant_deathbool: If active, the first mistake will end the game.
target_wpmNone or int: The desired speed to be shown as guide.

mltype.interactive.run_loop(stdscr, text, force_perfect=True, replay_tt=None, instant_death=False, target_wpm=None)[source]¶: Run curses loop - actual implementation.

mltype.ml module¶

Machine learning utilities.

class mltype.ml.LanguageDataset(X, y, vocabulary, transform=None)[source]¶

Bases: torch.utils.data.dataset.Dataset

Language dataset.

All the inputs of this class should be generated via create_data_language.

Parameters

X (np.ndarray) – Array of shape (n_samples, window_size) of dtype np.int8. It represents the features.
y (np.ndarray) – Array of shape (n_samples,) of dtype np.int8. It represents the targets
vocabulary (list) – List of characters in the vocabulary.
transform (callable or None) – Some callable that inputs X and y and returns some modified instances of them.

ohv_matrix¶

Matrix of shape (vocab_size + 1, vocab_size). The submatrix ohv_matrix[:vocab_size, :] is an identity matrix and is used for fast creation of one hot vectors. The last row of ohv_matrix is a zero vector and is used for encoding out-of-vocabulary characters.

Type: np.ndarray

class mltype.ml.SingleCharacterLSTM(vocab_size, hidden_size=16, n_layers=1, dense_size=128)[source]¶

Bases: pytorch_lightning.core.lightning.LightningModule

Single character recurrent neural network.

Given some string of characters, we generate the probability distribution of the next character.

Architecture starts with an LSTM (hidden_size, n_layers, vocab_size) network and then we feed the last hidden state to a fully connected network with one hidden layer (dense_size).

Parameters

vocab_size (int) – Size of the vocabulary. Necessary since we are encoding each character as a one hot vector.
hidden_size (int) – Hidden size of the recurrent cell.
n_layers (int) – Number of layers in the recurrent network.
dense_size (int) – Size of the single layer of the feed forward network.

rnn_layer¶

The recurrent network layer.

Type: torch.nn.Module

linear_layer1¶

Linear layer connecting the last hidden state and the single layer of the feedforward network.

Type: torch.nn.Module

linear_layer2¶

Linear layer connecting the single layer of the feedforward network with the output (of size vocabulary_size).

Type: torch.nn.Module

activation_layer¶

Softmax layer making sure we get a probability distribution.

Type: torch.nn.Module

configure_optimizers()[source]¶

Configure optimizers.

Necessary for pytorch-lightning.

Returns: optimizer – The chosen optimizer.
Return type: Optimizer

forward(x, h=None, c=None)[source]¶

Perform forward pass.

Parameters

x (torch.Tensor) – Input features of shape (batch_size, window_size, vocab_size). Note that the provided vocab_size needs to be equal to the one provided in the constructor. The remaining dimensions (batch_size and window_size) can be any positive integers.
h (torch.Tensor) – Hidden states of shape (n_layers, batch_size, hidden_size). Note that if provided we enter a continuation mode. In this case to generate the prediction we just use the last character and the hidden state for the prediction. Note that in this case we enforce that x.shape=(batch_size, 1, vocab_size).
c (torch.Tensor) – Hidden states of shape (n_layers, batch_size, hidden_size). Note that if provided we enter a continuation mode. In this case to generate the prediction we just use the last character and the hidden state for the prediction. Note that in this case we enforce that x.shape=(batch_size, 1, vocab_size).

Returns

probs (torch.Tensor) – Tensor of shape (batch_size, vocab_size). For each sample it represents the probability distribution over all characters in the vocabulary.
h_n, c_n (torch.Tensor) – New Hidden states of shape (n_layers, batch_size, hidden_size).

training: bool¶

training_step(batch, batch_idx)[source]¶

Run training step.

Necessary for pytorch-lightning.

Parameters

batch (tuple) – Batch of training samples. The exact definition depends on the dataloader.
batch_idx (idx) – Index of the batch.

Returns

loss – Tensor scalar representing the mean binary cross entropy over the batch.

Return type

torch.Tensor

validation_epoch_end(outputs)[source]¶

Run epoch end validation logic.

We sample 5 times 100 characters from the current network. We then print to the standard output.

Parameters: outputs (list) – List of batches that were collected over the validation set with validation_step.

validation_step(batch, batch_idx)[source]¶

Run validation step.

Optional for pytorch-lightning.

Parameters: batch (tuple) – Batch of validation samples. The exact definition depends on the dataloader.

batch_idxidx: Index of the batch.

Returns: vocabulary – Vocabulary in order to have access in validation_epoch_end.
Return type: list

mltype.ml.create_data_language(text, vocabulary, window_size=2, fill_strategy='zeros', verbose=False)[source]¶

Create a supervised dataset for the characte/-lever language model.

Parameters

text (str) – Some text.
vocabulary (list) – Unique list of supported characters. Their corresponding indices are going to be used for the one hot encoding.
window_size (int) – The number of previous characters to condition on.
fill_strategy (str, {"skip", "zeros"}) – Strategy for handling initial characters and unknown characters.
verbose (bool) – If True, progress bar is showed.

Returns

X (np.ndarray) – Features array of shape (len(text), window_size) if fill_strategy=zeros, otherwise it might be shorter. The dtype is np.int8. If applicable, the integer (len(vocabulary)) represnts a zero vector (out of vocabulary token).
y (np.ndarray) – Targets array of shape (len(text),) if fill_strategy=zeros, otherwise it might be shorter. The dtype is np.int8.
indices (np.ndarray) – For each sample an index of the character we are trying to predict. Note that for fill_strategy=”zeros” it is going to be np.arange(len(text)). However, for different strategies might have gaps. It helps us to keep track of the sample - character correspondence.

mltype.ml.load_model(path)[source]¶

Load serialized model and vocabulary.

Parameters

path (pathlib.Path) – Path to where the file lies. This file was created by save_model method.

Returns

model_inst (SingleCharacterLSTM) – Instance of the model. Note that all of its parameters will be lying on a CPU.
vocabulary (list) – Corresponding vocabulary.

mltype.ml.run_train(texts, name, max_epochs=10, window_size=50, batch_size=32, vocab_size=None, fill_strategy='skip', illegal_chars='', train_test_split=0.5, hidden_size=32, dense_size=32, n_layers=1, checkpoint_path=None, output_path=None, use_mlflow=True, early_stopping=True, gpus=None)[source]¶

Run the training loop.

Note that the parameters are also explained in the cli of mlt train.

Parameters

texts (list) – List of str representing all texts we would like to train on.
name (str) – Name of the model. This name is only used when we save the model - it is not hardcoded anywhere in the serialization.
max_epochs (int) – Maximum number of epochs. Note that the number of actual epochs can be lower if we activate the early_stopping flag.
window_size (int) – Number of previous characters to consider when predicting the next character. The higher the number the longer the memory we are enforcing. Howerever, at the same time, the training becomes slower.
batch_size (int) – Number of samples in one batch.
vocab_size (int) – Maximum number of characters to be put in the vocabulary. Note that one can explicityly exclude characters via illegal_chars. The higher this number the bigger the feature vectors are and the slower the training.
fill_strategy (str, {"zeros", "skip"}) – Determines how to deal with out of vocabulary characters. When “zeros” then we simply encode them as zero vectors. If “skip”, we skip a given sample if any of the characters in the window or the predicted character are not in the vocabulary.
illegal_chars (str or None) – If specified, then each character of the str represents a forbidden character that we do not put in the vocabulary.
train_test_split (float) – Float in the range (0, 1) representing the percentage of the training set with respect to the entire dataset.
hidden_size (int) – Hidden size of LSTM cells (equal in all layers).
dense_size (int) – Size of the dense layer that is bridging the hidden state outputted by the LSTM and the final output probabilities over the vocabulary.
n_layers (int) – Number of layers inside of the LSTM.
checkpoint_path (None or pathlib.Path or str) – If specified, it is pointing to a checkpoint file (generated by Pytorch-lightning). This file does not contain the vocabulary. It can be used to continue the training.
output_path (None or pathlib.Path or str) – If specified, it is an alternative output folder when the trained models and logging information will be stored. If not specified the output folder is by default set to ~/.mltype.
use_mlflow (bool) – If active, than we use mlflow for logging of training and validation loss. Additionally, at the end of each epoch we generate a few sample texts to demonstrate how good/bad the current network is.
early_stopping (bool) – If True, then we monitor the validation loss and if it does not improve for a certain number of epochs then we stop the traning.
gpus (int or None) – If None or 0, no GPUs are used (only CPUs). Otherwise, it represents the number of GPUs to be used (using the data parallelization strategy).

mltype.ml.sample_char(network, vocabulary, h=None, c=None, previous_chars=None, random_state=None, top_k=None, device=None)[source]¶

Sample a character given network probability prediciton (with a state).

Parameters

network (torch.nn.Module) – Trained neural network that outputs a probability distribution over vocabulary.
vocabulary (list) – List of unique characters.
h (torch.Tensor) – Hidden states with shape (n_layers, batch_size=1, hidden_size). Note that if both of them are None we are at the very first character.
c (torch.Tensor) – Hidden states with shape (n_layers, batch_size=1, hidden_size). Note that if both of them are None we are at the very first character.
previous_chars (None or str) – Previous charaters. None or and empty string if we are at the very first character.
random_state (None or int) – Guarantees reproducibility.
top_k (None or int) – If specified, we only sample from the top k most probably characters. Otherwise all of them.
device (None or torch.device) – By default torch.device(“cpu”).

Returns

ch – A character from the vocabulary.

Return type

str

mltype.ml.sample_text(n_chars, network, vocabulary, initial_text=None, random_state=None, top_k=None, verbose=False, device=None)[source]¶

Sample text by unrolling character by character predictions.

Note that keep the pass hidden states with each character prediciton and there is not need to specify a window.

Parameters

n_chars (int) – Number of characters to sample.
network (torch.nn.Module) – Pretrained character level network.
vocabulary (list) – List of unique characters.
initial_text (None or str) – If specified, initial text to condition based on.
random_state (None or int) – Allows reproducibility.
top_k (None or int) – If specified, we only sample from the top k most probable characters. Otherwise all of them.
verbose (bool) – Controls verbosity.
device (None or torch.device) – By default torch.device(“cpu”).

Returns

text – Generated text of length n_chars + len(initial_text).

Return type

str

mltype.ml.save_model(model, vocabulary, path)[source]¶

Serialize a model.

Note that we require that the model has a property hparams that we can unpack into the constructor of the class and get the same network architecture. This is automatically the case if we subclass from pl.LightningModule.

Parameters

model (SingleCharacterLSTM) – Torch model to be saved. Additionally, we require that it has the hparams property that contains all necessary hyperparameters to instantiate the model.
vocabulary (list) – The corresponding vocabulary.
path (pathlib.Path) – Path to the file that will whole the serialized object.

mltype.ml.text2features(text, vocabulary)[source]¶

Create per character one hot encoding.

Note that we employ the zeros strategy out of vocabulary characters.

Parameters

text (str) – Text.
vocabulary (list) – Vocabulary to be used for the endcoding.

Returns

res – Array of shape (len(text), len(vocabulary) of boolean dtype. Each row represents the one hot encoding of the respective character. Note that out of vocabulary characters are encoding with a zero vector.

Return type

np.ndarray

mltype.stats module¶

Computation of various statistics.

mltype.stats.times_per_character(tt)[source]¶

Compute per caracter analysis.

Parameters: tt (TypedText) – Instance of the TypedText.
Returns: stats – Keys are characters and values are list of time intervals it took to write the last correct instance.
Return type: dict

mltype.utils module¶

Collection of utility functions.

mltype.utils.get_cache_dir(predefined_path=None)[source]¶

Get the cache directory path and potentially create it.

If no predefined path provided, we simply take ~/.mltype. Note that if one changes the os.environ[“home”] dynamically it will influence the output of this function. this is done on purpose to simplify testing.

Parameters: predefined_path (None or pathlib.Path or str) – If provided, we just return the same path. We potentially create the directory if it does not exist. If it is not provided we use $HOME/.mltype.
Returns: path – Path to where the caching directory is located.
Return type: pathlib.Path

mltype.utils.get_mlflow_artifacts_path(client, run_id)[source]¶

Get path to where the artifacts are located.

The benefit is that we can log any file into it and even create a custom folder hierarachy.

Parameters

client (mlflow.tracking.MlflowClient) – Client.
run_id (str) – Unique identifier of a run.

Returns

path – Path to the root folder of artifacts.

Return type

pathlib.Path

mltype.utils.print_section(name, fill_char='=', drop_end=False, add_ts=True)[source]¶

Print nice section blocks.

Parameters

name (str) – Name of the section.
fill_char (str) – Character to be used for filling the line.
drop_end (bool) – If True, the ending line is not printed.
add_ts (bool) – We add a time step to the heading.

Module contents¶

Python package.