Module hdlib.parser

Utility to parse input files.

This module provides a set of utilities to parse input tables and split the dataset into training and test sets as a simple percentage split or cross validation.

Functions

def load_dataset(filepath: , sep: str = '\t') ‑> Tuple[List[str], List[List[float]], List[str]]

Load the input numerical dataset.

Parameters

filepath : str
Path to the input dataset.
sep : str
Filed separator for the input dataset.

Returns

tuple
A tuple with a list of sample IDs, a list of features, a list of lists with the actual numerical data (floats), and a list with class labels.

Raises

FileNotFoundError
If the input file does not exist.
ValueError
If the input dataset does not contain number only.
def percentage_split(labels: List[str], percentage: float, seed: int = 0) ‑> List[int]

Given list of classes as appear in the original dataset and a percentage number, split a dataset and report the indices of the selected data points.

Parameters

labels : list
List of class labels as they appear in the original dataset.
percentage : float
Percentage of points to split out of the original dataset.
seed : int
Random seed for reproducing the same results.

Returns

list
A list with the indices of selected points.

Raises

ValueError
  • if the input percentage is lower than or equal to 0.0 or greater than 100.0;
  • if the input seed is not an integer number.

Examples

>>> from hdlib.parser import percentage_split
>>> labels = [1, 2, 2, 2, 1, 1, 1, 1, 2, 2]
>>> percentage_split(labels, 20.0, seed=0)
[6, 9]

Consider a dataset with 10 data points, select 20% of the points (2 points in this case), and report their indices in the original dataset.