For some machine learning purpose, I need to work with sequences with different lengths. To be able to process efficiently those sequences, I need to process them in batches of size
size_batch. A batch typically has 4 dimensions and I want to convert it to a numpy’s
ndarray with 4 dimensions. For each sequence, I need to pad with some defined
pad_value such that each element has the same size: the maximal size.
For example with 3 dimensional input:
[[[0, 1, 2], , [4, 5]], [], [[7, 8], ]]
desired output for
pad_value -1 is:
[[[0, 1, 2], [3, -1, -1], [4, 5, -1]], [[6, -1, -1], [-1, -1, -1], [-1, -1, -1]] [[7, 8, -1], [9, -1, -1], [-1, -1, -1]]]
which has shape (3, 3, 3). For this problem, one can suppose there are no empty list in the input. Here is the solution I came up with:
import numpy as np import itertools as it from typing import List def pad(array: List, pad_value: np.int32, dtype: type = np.int32) -> np.ndarray: """ Pads a nested list to the max shape and fill empty values with pad_value :param array: high dimensional list to be padded :param pad_value: value appended to :param dtype: type of the output :return: padded copy of param array """ # Get max shape def get_max_shape(arr, ax=0, dims=): try: if ax >= len(dims): dims.append(len(arr)) else: dims[ax] = max(dims[ax], len(arr)) for i in arr: get_max_shape(i, ax+1, dims) except TypeError: # On non iterable / lengthless objects (leaves) pass return dims dims = get_max_shape(array) # Pad values def get_item(arr, idx): while True: i, *idx = idx arr = arr[i] if not idx: break return arr r = np.zeros(dims, dtype=dtype) + pad_value for idx in it.product(*map(range, dims)): # idx run though all possible tuple of indices that might # contain a value in array try: r[idx] = get_item(array, idx) except IndexError: continue return r
It does not feel really pythonic but does the job. Is there any better way to do it I should know ? I think I might be able to improve its speed by doing smart breaks in the last loop but I haven’t dug much yet.