mltype package¶
Submodules¶
mltype.base module¶
Building blocks.
-
class
mltype.base.
Action
(pressed_key, status, ts)[source]¶ Bases:
object
Representation of one keypress.
- Parameters
pressed_key (str) – What key was pressed. We define a convention that pressing a backspace will be represented as pressed_key=None.
status (int) –
What was the status AFTER pushing the key. It should be one of the following integers:
STATUS_BACKSPACE
STATUS_CORRECT
STATUS_WRONG
ts (datetime) – The timestamp corresponding to this action.
-
class
mltype.base.
TypedText
(text)[source]¶ Bases:
object
Abstraction that represenets the text that needs to be typed.
- Parameters
text (str) – Text that needs to be typed.
-
actions
¶ List of lists of Action instances of length equal to len(text). It logs per character all actions that have been taken on it.
- Type
list
-
start_ts
¶ Timestamp of when the first action was performed (not the time of initialization).
- Type
datetime or None
-
end_ts
¶ Timestamp of when the last action was taken. Note that it is the action that lead to the text being correctly typed in it’s entirity.
- Type
datetime or None
-
check_finished
(force_perfect=True)[source]¶ Determine whether the typing has been finished successfully.
- Parameters
force_perfect (bool) – If True, one can only finished if all the characters were typed correctly. Otherwise, all characters need to be either correct or wrong.
-
property
elapsed_seconds
¶ Get the number of seconds elapsed from the first action.
-
classmethod
load
(path)[source]¶ Load a pickled file.
- Parameters
path (pathlib.Path) – Path to the pickle file.
- Returns
typed_text – Instance of the
TypedText
- Return type
-
property
n_actions
¶ Get the number of actions that have been taken.
-
property
n_backspace_actions
¶ Get the number of backspace actions.
-
property
n_backspace_characters
¶ Get the number of characters that have been backspaced.
-
property
n_characters
¶ Get the number of characters in the text.
-
property
n_correct_characters
¶ Get the number of characters that have been typed correctly.
-
property
n_untouched_characters
¶ Get the number of characters that have not been touched yet.
-
property
n_wrong_characters
¶ Get the number of characters that have been typed wrongly.
-
save
(path)[source]¶ Save internal state of this TypedText.
Can be loaded via the class method
load
.- Parameters
path (pathlib.Path) – Where the .rlt file will be store.
mltype.cli module¶
Command line interface.
mltype.data module¶
Data creating and managing.
-
mltype.data.
file2text
(filepath, verbose=True)[source]¶ Read all lines of a file into a string.
Note that we destroy all the new line characters and all the whitespace charecters on both ends of the line. Note that this is very radical for source code of programming languages or similar.
- Parameters
filepath (pathlib.Path) – Path to the file
verbose (bool) – If True, we print the name of the file.
- Returns
text – All the text found in the input file.
- Return type
str
mltype.interactive module¶
Module implementing interaction logic.
-
class
mltype.interactive.
Cursor
(stdscr)[source]¶ Bases:
object
Utility class that can locate and modify the position of a cursor.
-
class
mltype.interactive.
Pen
(font, background, i)[source]¶ Bases:
object
Represents background and font color.
-
class
mltype.interactive.
TypedTextWriter
(tt, stdscr, y_start=0, x_start=0, replay_tt=None, target_wpm=None)[source]¶ Bases:
object
Curses writer that uses the TypedText object.
We make an assumption that the x and y position of the starting character stay the same.
- Parameters
tt (TypedText) – Text that the user is going to type.
stdscr (curses.Window) – Main curses window.
y_start (int) – Coordinates of the first character.
x_start (int) – Coordinates of the first character.
replay_tt (TypedText or None) – If provided, it represents a previously typed text that we want to dynamically plot together with the current typing.
-
current_ix
¶ Represents the index of the character of self.tt.text that we are about to type. Note this is exactly the character on which the cursor will be lying.
- Type
int
-
pens
¶ The keys are integers representing different statuses. The values are Pen objects representing how to format a character with a given status. Note that if replay_tt is provided we add a new entry “replay” and it represents the style of replay character.
- Type
dict
-
replay_uactions
¶ The unrolled actions of the replay.
- Type
list
-
replay_elapsed
¶ The same length as replay_uactions. It stores the elapsed times (since the start) of all the actions. Note that it is going to be sorted in an ascending order and we can do binary search on it.
- Type
list
-
target_wpm
¶ If specified, we display the uniform run that leads to that speed.
- Type
int or None
-
property
screen_status
¶ Get screen information.
- Returns
i_start (int) – Integer representing the number of cells away from the start we are.
height, width (int) – Height, width of the screen. Note that user my resize during a session.
-
mltype.interactive.
main_basic
(text, force_perfect, output_file, instant_death, target_wpm)[source]¶ Run main curses loop with no previous replay.
- Parameters
force_perfect (bool) – If True, then one cannot finish typing before all characters are typed without any mistakes.
output_file (str or pathlib.Path or None) – If
pathlib.Path
then we store the typed text in this file. If None, no saving is taking place.instant_death (bool) – If active, the first mistake will end the game.
target_wpm (int or None) – The desired speed to be displayed as a guide.
-
mltype.interactive.
main_replay
(replay_file, force_perfect, overwrite, instant_death, target_wpm)[source]¶ Run main curses loop with a replay.
- Parameters
force_perfect (bool) – If True, then one cannot finish typing before all characters are typed without any mistakes.
- overwritebool
If True, the replay file will be overwritten in case we are faster than it.
- replay_filestr or pathlib.Path
Typed text in this file from some previous game.
- instant_deathbool
If active, the first mistake will end the game.
- target_wpmNone or int
The desired speed to be shown as guide.
mltype.ml module¶
Machine learning utilities.
-
class
mltype.ml.
LanguageDataset
(X, y, vocabulary, transform=None)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Language dataset.
All the inputs of this class should be generated via create_data_language.
- Parameters
X (np.ndarray) – Array of shape (n_samples, window_size) of dtype np.int8. It represents the features.
y (np.ndarray) – Array of shape (n_samples,) of dtype np.int8. It represents the targets
vocabulary (list) – List of characters in the vocabulary.
transform (callable or None) – Some callable that inputs X and y and returns some modified instances of them.
-
ohv_matrix
¶ Matrix of shape (vocab_size + 1, vocab_size). The submatrix ohv_matrix[:vocab_size, :] is an identity matrix and is used for fast creation of one hot vectors. The last row of ohv_matrix is a zero vector and is used for encoding out-of-vocabulary characters.
- Type
np.ndarray
-
class
mltype.ml.
SingleCharacterLSTM
(vocab_size, hidden_size=16, n_layers=1, dense_size=128)[source]¶ Bases:
pytorch_lightning.core.lightning.LightningModule
Single character recurrent neural network.
Given some string of characters, we generate the probability distribution of the next character.
Architecture starts with an LSTM (hidden_size, n_layers, vocab_size) network and then we feed the last hidden state to a fully connected network with one hidden layer (dense_size).
- Parameters
vocab_size (int) – Size of the vocabulary. Necessary since we are encoding each character as a one hot vector.
hidden_size (int) – Hidden size of the recurrent cell.
n_layers (int) – Number of layers in the recurrent network.
dense_size (int) – Size of the single layer of the feed forward network.
-
rnn_layer
¶ The recurrent network layer.
- Type
torch.nn.Module
-
linear_layer1
¶ Linear layer connecting the last hidden state and the single layer of the feedforward network.
- Type
torch.nn.Module
-
linear_layer2
¶ Linear layer connecting the single layer of the feedforward network with the output (of size vocabulary_size).
- Type
torch.nn.Module
-
activation_layer
¶ Softmax layer making sure we get a probability distribution.
- Type
torch.nn.Module
-
configure_optimizers
()[source]¶ Configure optimizers.
Necessary for pytorch-lightning.
- Returns
optimizer – The chosen optimizer.
- Return type
Optimizer
-
forward
(x, h=None, c=None)[source]¶ Perform forward pass.
- Parameters
x (torch.Tensor) – Input features of shape (batch_size, window_size, vocab_size). Note that the provided vocab_size needs to be equal to the one provided in the constructor. The remaining dimensions (batch_size and window_size) can be any positive integers.
h (torch.Tensor) – Hidden states of shape (n_layers, batch_size, hidden_size). Note that if provided we enter a continuation mode. In this case to generate the prediction we just use the last character and the hidden state for the prediction. Note that in this case we enforce that x.shape=(batch_size, 1, vocab_size).
c (torch.Tensor) – Hidden states of shape (n_layers, batch_size, hidden_size). Note that if provided we enter a continuation mode. In this case to generate the prediction we just use the last character and the hidden state for the prediction. Note that in this case we enforce that x.shape=(batch_size, 1, vocab_size).
- Returns
probs (torch.Tensor) – Tensor of shape (batch_size, vocab_size). For each sample it represents the probability distribution over all characters in the vocabulary.
h_n, c_n (torch.Tensor) – New Hidden states of shape (n_layers, batch_size, hidden_size).
-
training
: bool¶
-
training_step
(batch, batch_idx)[source]¶ Run training step.
Necessary for pytorch-lightning.
- Parameters
batch (tuple) – Batch of training samples. The exact definition depends on the dataloader.
batch_idx (idx) – Index of the batch.
- Returns
loss – Tensor scalar representing the mean binary cross entropy over the batch.
- Return type
torch.Tensor
-
validation_epoch_end
(outputs)[source]¶ Run epoch end validation logic.
We sample 5 times 100 characters from the current network. We then print to the standard output.
- Parameters
outputs (list) – List of batches that were collected over the validation set with validation_step.
-
validation_step
(batch, batch_idx)[source]¶ Run validation step.
Optional for pytorch-lightning.
- Parameters
batch (tuple) – Batch of validation samples. The exact definition depends on the dataloader.
- batch_idxidx
Index of the batch.
- Returns
vocabulary – Vocabulary in order to have access in validation_epoch_end.
- Return type
list
-
mltype.ml.
create_data_language
(text, vocabulary, window_size=2, fill_strategy='zeros', verbose=False)[source]¶ Create a supervised dataset for the characte/-lever language model.
- Parameters
text (str) – Some text.
vocabulary (list) – Unique list of supported characters. Their corresponding indices are going to be used for the one hot encoding.
window_size (int) – The number of previous characters to condition on.
fill_strategy (str, {"skip", "zeros"}) – Strategy for handling initial characters and unknown characters.
verbose (bool) – If True, progress bar is showed.
- Returns
X (np.ndarray) – Features array of shape (len(text), window_size) if fill_strategy=zeros, otherwise it might be shorter. The dtype is np.int8. If applicable, the integer (len(vocabulary)) represnts a zero vector (out of vocabulary token).
y (np.ndarray) – Targets array of shape (len(text),) if fill_strategy=zeros, otherwise it might be shorter. The dtype is np.int8.
indices (np.ndarray) – For each sample an index of the character we are trying to predict. Note that for fill_strategy=”zeros” it is going to be np.arange(len(text)). However, for different strategies might have gaps. It helps us to keep track of the sample - character correspondence.
-
mltype.ml.
load_model
(path)[source]¶ Load serialized model and vocabulary.
- Parameters
path (pathlib.Path) – Path to where the file lies. This file was created by save_model method.
- Returns
model_inst (SingleCharacterLSTM) – Instance of the model. Note that all of its parameters will be lying on a CPU.
vocabulary (list) – Corresponding vocabulary.
-
mltype.ml.
run_train
(texts, name, max_epochs=10, window_size=50, batch_size=32, vocab_size=None, fill_strategy='skip', illegal_chars='', train_test_split=0.5, hidden_size=32, dense_size=32, n_layers=1, checkpoint_path=None, output_path=None, use_mlflow=True, early_stopping=True, gpus=None)[source]¶ Run the training loop.
Note that the parameters are also explained in the cli of mlt train.
- Parameters
texts (list) – List of str representing all texts we would like to train on.
name (str) – Name of the model. This name is only used when we save the model - it is not hardcoded anywhere in the serialization.
max_epochs (int) – Maximum number of epochs. Note that the number of actual epochs can be lower if we activate the early_stopping flag.
window_size (int) – Number of previous characters to consider when predicting the next character. The higher the number the longer the memory we are enforcing. Howerever, at the same time, the training becomes slower.
batch_size (int) – Number of samples in one batch.
vocab_size (int) – Maximum number of characters to be put in the vocabulary. Note that one can explicityly exclude characters via illegal_chars. The higher this number the bigger the feature vectors are and the slower the training.
fill_strategy (str, {"zeros", "skip"}) – Determines how to deal with out of vocabulary characters. When “zeros” then we simply encode them as zero vectors. If “skip”, we skip a given sample if any of the characters in the window or the predicted character are not in the vocabulary.
illegal_chars (str or None) – If specified, then each character of the str represents a forbidden character that we do not put in the vocabulary.
train_test_split (float) – Float in the range (0, 1) representing the percentage of the training set with respect to the entire dataset.
hidden_size (int) – Hidden size of LSTM cells (equal in all layers).
dense_size (int) – Size of the dense layer that is bridging the hidden state outputted by the LSTM and the final output probabilities over the vocabulary.
n_layers (int) – Number of layers inside of the LSTM.
checkpoint_path (None or pathlib.Path or str) – If specified, it is pointing to a checkpoint file (generated by Pytorch-lightning). This file does not contain the vocabulary. It can be used to continue the training.
output_path (None or pathlib.Path or str) – If specified, it is an alternative output folder when the trained models and logging information will be stored. If not specified the output folder is by default set to ~/.mltype.
use_mlflow (bool) – If active, than we use mlflow for logging of training and validation loss. Additionally, at the end of each epoch we generate a few sample texts to demonstrate how good/bad the current network is.
early_stopping (bool) – If True, then we monitor the validation loss and if it does not improve for a certain number of epochs then we stop the traning.
gpus (int or None) – If None or 0, no GPUs are used (only CPUs). Otherwise, it represents the number of GPUs to be used (using the data parallelization strategy).
-
mltype.ml.
sample_char
(network, vocabulary, h=None, c=None, previous_chars=None, random_state=None, top_k=None, device=None)[source]¶ Sample a character given network probability prediciton (with a state).
- Parameters
network (torch.nn.Module) – Trained neural network that outputs a probability distribution over vocabulary.
vocabulary (list) – List of unique characters.
h (torch.Tensor) – Hidden states with shape (n_layers, batch_size=1, hidden_size). Note that if both of them are None we are at the very first character.
c (torch.Tensor) – Hidden states with shape (n_layers, batch_size=1, hidden_size). Note that if both of them are None we are at the very first character.
previous_chars (None or str) – Previous charaters. None or and empty string if we are at the very first character.
random_state (None or int) – Guarantees reproducibility.
top_k (None or int) – If specified, we only sample from the top k most probably characters. Otherwise all of them.
device (None or torch.device) – By default torch.device(“cpu”).
- Returns
ch – A character from the vocabulary.
- Return type
str
-
mltype.ml.
sample_text
(n_chars, network, vocabulary, initial_text=None, random_state=None, top_k=None, verbose=False, device=None)[source]¶ Sample text by unrolling character by character predictions.
Note that keep the pass hidden states with each character prediciton and there is not need to specify a window.
- Parameters
n_chars (int) – Number of characters to sample.
network (torch.nn.Module) – Pretrained character level network.
vocabulary (list) – List of unique characters.
initial_text (None or str) – If specified, initial text to condition based on.
random_state (None or int) – Allows reproducibility.
top_k (None or int) – If specified, we only sample from the top k most probable characters. Otherwise all of them.
verbose (bool) – Controls verbosity.
device (None or torch.device) – By default torch.device(“cpu”).
- Returns
text – Generated text of length n_chars + len(initial_text).
- Return type
str
-
mltype.ml.
save_model
(model, vocabulary, path)[source]¶ Serialize a model.
Note that we require that the model has a property hparams that we can unpack into the constructor of the class and get the same network architecture. This is automatically the case if we subclass from pl.LightningModule.
- Parameters
model (SingleCharacterLSTM) – Torch model to be saved. Additionally, we require that it has the hparams property that contains all necessary hyperparameters to instantiate the model.
vocabulary (list) – The corresponding vocabulary.
path (pathlib.Path) – Path to the file that will whole the serialized object.
-
mltype.ml.
text2features
(text, vocabulary)[source]¶ Create per character one hot encoding.
Note that we employ the zeros strategy out of vocabulary characters.
- Parameters
text (str) – Text.
vocabulary (list) – Vocabulary to be used for the endcoding.
- Returns
res – Array of shape (len(text), len(vocabulary) of boolean dtype. Each row represents the one hot encoding of the respective character. Note that out of vocabulary characters are encoding with a zero vector.
- Return type
np.ndarray
mltype.stats module¶
Computation of various statistics.
mltype.utils module¶
Collection of utility functions.
-
mltype.utils.
get_cache_dir
(predefined_path=None)[source]¶ Get the cache directory path and potentially create it.
If no predefined path provided, we simply take ~/.mltype. Note that if one changes the os.environ[“home”] dynamically it will influence the output of this function. this is done on purpose to simplify testing.
- Parameters
predefined_path (None or pathlib.Path or str) – If provided, we just return the same path. We potentially create the directory if it does not exist. If it is not provided we use $HOME/.mltype.
- Returns
path – Path to where the caching directory is located.
- Return type
pathlib.Path
-
mltype.utils.
get_mlflow_artifacts_path
(client, run_id)[source]¶ Get path to where the artifacts are located.
The benefit is that we can log any file into it and even create a custom folder hierarachy.
- Parameters
client (mlflow.tracking.MlflowClient) – Client.
run_id (str) – Unique identifier of a run.
- Returns
path – Path to the root folder of artifacts.
- Return type
pathlib.Path
-
mltype.utils.
print_section
(name, fill_char='=', drop_end=False, add_ts=True)[source]¶ Print nice section blocks.
- Parameters
name (str) – Name of the section.
fill_char (str) – Character to be used for filling the line.
drop_end (bool) – If True, the ending line is not printed.
add_ts (bool) – We add a time step to the heading.
Module contents¶
Python package.