Module hdlib.parser
Utility to parse input files.
This module provides a set of utilities to parse input tables and split the dataset into training and test sets as a simple percentage split or cross validation.
Functions
def load_dataset(filepath:
, sep: str = '\t') ‑> Tuple[List[str], List[List[float]], List[str]] -
Load the input numerical dataset.
Parameters
filepath
:str
- Path to the input dataset.
sep
:str
- Filed separator for the input dataset.
Returns
tuple
- A tuple with a list of sample IDs, a list of features, a list of lists with the actual numerical data (floats), and a list with class labels.
Raises
FileNotFoundError
- If the input file does not exist.
ValueError
- If the input dataset does not contain number only.
def percentage_split(labels: List[str], percentage: float, seed: int = 0) ‑> List[int]
-
Given list of classes as appear in the original dataset and a percentage number, split a dataset and report the indices of the selected data points.
Parameters
labels
:list
- List of class labels as they appear in the original dataset.
percentage
:float
- Percentage of points to split out of the original dataset.
seed
:int
- Random seed for reproducing the same results.
Returns
list
- A list with the indices of selected points.
Raises
ValueError
-
- if the input
percentage
is lower than or equal to 0.0 or greater than 100.0; - if the input
seed
is not an integer number.
- if the input
Examples
>>> from hdlib.parser import percentage_split >>> labels = [1, 2, 2, 2, 1, 1, 1, 1, 2, 2] >>> percentage_split(labels, 20.0, seed=0) [6, 9]
Consider a dataset with 10 data points, select 20% of the points (2 points in this case), and report their indices in the original dataset.