datapad.Sequence

class datapad.Sequence(_iterable=None)

The core object in datapad used to wrap sequence-like data types in a fluent-style API.

__init__(_iterable=None)

Instantiates a new Sequence object.

Parameters:_iterable (List, Set, Tuple, Iterator) – Any object that conforms to the Iterable API.

Methods

__init__([_iterable]) Instantiates a new Sequence object.
all() Returns a standard python iterator that you can use to lazily iterate over your sequence of data
batch(size) Lazily combines elements in sequence into a list of length size.
cache([overwrite]) Greedily stores results of self.collect() in an internal variable that can later be used to reset the Sequence to the beginning of the iterator.
collect() Eagerly returns all elements in sequence
concat(seq) Concatenates another sequence to the end of this sequence
count([distinct]) Eagerly count number of elements in sequence
distinct() Eagerly returns a new sequence with unique values
drop(count) Lazily skip or drop over count elements.
drop_if(fn) Lazily apply fn function to every element of iterable and drop sequence elements where the function fn evaluates to True.
filter(fn) This is an alias for the Sequence.keep_if function
first() Eagerly returns first element in sequence
flatmap(fn) Lazily apply fn function to every element of iterable and chain the output into a single flattend sequence.
groupby([key, getter, eager_group]) Groups sequence using key function,
join(other[, key, other_key]) Joins two sequences based on common field matches between the sequence and other.
keep_if(fn) Lazily apply fn function to every element of iterable and keep only sequence elements where the function fn evaluates to True.
map(fn) Lazily apply fn function to every element of iterable
next() Eagerly returns next element in sequence (alias for first() function)
peek([count]) Returns list of count elements without advancing sequence iterator.
pmap(fn[, workers, ordered]) Lazily apply fn function to every element of iterable, in parallel using multiprocess.dummy.Pool .
reduce(fn[, initial]) Eagerly apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value.
reset() Uses the internal cache of the Sequence to reset to beginning of iterator
shuffle() Eagerly shuffles your sequence and returns a newly created sequence containing the shuffled items.
sort([key]) Eagerly sorts your sequence and returns a newly created sequence containing the sorted items.
take(count) Lazily returns a sequence of the first count elements.
window(size[, stride]) Lazily slides and yields a window of length size over sequence.
zip_with_index() Add an to each item in sequence
all()

Returns a standard python iterator that you can use to lazily iterate over your sequence of data

>>> seq = Sequence(range(10))
>>> seq = seq.map(lambda v: v*2)
>>> i = 0
>>> for item in seq.all():
...     i += item
>>> i
90
batch(size)

Lazily combines elements in sequence into a list of length size. This function will drop any remainder if the sequence ends before a batch with size has been created.

Parameters:size (int) – The batch size.

Examples

>>> seq = Sequence(range(10))
>>> seq.batch(3).collect()
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
cache(overwrite=False)

Greedily stores results of self.collect() in an internal variable that can later be used to reset the Sequence to the beginning of the iterator. This is useful if you want to make multiple passes over the data. This function is meant to be used in conjunction with reset().

Parameters:overwrite (bool) – By default multiple calls to the cache function will only cache the initial state of the iterator. If you set overwrite to True, this will take the current state of the iterator and use it’s results to save it to the internal cache.
>>> seq = Sequence([1,2,3])
>>> _ = seq.cache()
>>> seq.collect()
[1, 2, 3]
>>> _ = seq.reset()
>>> seq.collect()
[1, 2, 3]
>>> _ = seq.reset()
>>> seq.next()
1
>>> _ = seq.cache(overwrite=True)
>>> seq.collect()
[2, 3]
>>> _ = seq.reset()
>>> seq.collect()
[2, 3]
collect()

Eagerly returns all elements in sequence

>>> seq = Sequence(range(10))
>>> seq.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> seq.collect()
[]
concat(seq)

Concatenates another sequence to the end of this sequence

Examples

Concat two sequences together:

>>> s1 = Sequence(['a', 'b', 'c'])
>>> s2 = Sequence(range(3))
>>> s3 = s2.concat(s1)
>>> s3.collect()
[0, 1, 2, 'a', 'b', 'c']

Concat sequence with itself:

>>> seq = Sequence(range(5))
>>> seq = seq.concat(seq)
>>> seq.collect()
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
count(distinct=False)

Eagerly count number of elements in sequence

Parameters:distinct – bool If True, counts occurence of each distinct value in sequence.
Returns:Either an integer count or a new sequence of tuples where the first value is the unique element and the second value is the number of times that element appeared in the sequence.
>>> seq = Sequence(range(5))
>>> seq.count()
5
>>> seq = Sequence(['a', 'a', 'b', 'b', 'c', 'c'])
>>> seq.count(distinct=True).collect()
[('a', 2), ('b', 2), ('c', 2)]
distinct()

Eagerly returns a new sequence with unique values

>>> seq = Sequence(['a', 'a', 'b', 'b', 'c', 'c'])
>>> seq.distinct().collect()
['a', 'b', 'c']
drop(count)

Lazily skip or drop over count elements.

>>> seq = Sequence(range(5))
>>> seq.collect()
[0, 1, 2, 3, 4]
>>> seq = Sequence(range(5))
>>> seq = seq.drop(2)
>>> seq.collect()
[2, 3, 4]
drop_if(fn)

Lazily apply fn function to every element of iterable and drop sequence elements where the function fn evaluates to True.

Parameters:fn – function Function with signature fn(element) -> bool to apply to every element of sequence. Drop all elements in the sequence where the fn function evaluates to True.
>>> seq = Sequence(range(5))
>>> seq = seq.drop_if(lambda v: v > 1)
>>> seq.collect()
[0, 1]
filter(fn)

This is an alias for the Sequence.keep_if function

>>> seq = Sequence(range(5))
>>> seq = seq.filter(lambda v: v > 1)
>>> seq.collect()
[2, 3, 4]
first()

Eagerly returns first element in sequence

Examples

Get first value in sequence:

>>> seq = Sequence(range(5))
>>> seq.first()
0
>>> seq.first()
1

Calling first on empty sequence returns None:

>>> seq = Sequence([])
>>> seq.first()
flatmap(fn)

Lazily apply fn function to every element of iterable and chain the output into a single flattend sequence.

Parameters:fn (function) – Function with signature fn(element) -> iterable(element) to apply to every element of sequence.

Examples

>>> seq = Sequence(range(5))
>>> seq = seq.flatmap(lambda v: [v,v])
>>> seq.collect()
[0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
groupby(key=None, getter=None, eager_group=True)

Groups sequence using key function,

Note: you must ensure elements are sorted by groups before calling this function.

Parameters:
  • key – function Function used to determine what to use as a key for grouping
  • getter – function Function to be applied to each element of a group
  • eager_group – bool, default=True If true, eagerly convert a group from a lazy Sequence to a fully-realized list.

Examples

Simple usage:

>>> from pprint import pprint
>>> seq = Sequence(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'd'])
>>> res = seq.sort().groupby(key=lambda x: x).collect()
>>> res == [
...    ('a', ['a', 'a', 'a']),
...    ('b', ['b', 'b']),
...    ('c', ['c']),
...    ('d', ['d', 'd']),
... ]
True

Grouping with getter function:

>>> things = [("animal", "lion"),
...           ("plant", "maple tree"),
...           ("animal", "walrus"),
...           ("plant", "grass")]
>>> seq = Sequence(things)
>>> res = seq.sort().groupby(key=lambda x: x[0], getter=lambda x: x[1]).collect()
>>> res == [
...    ('animal', ['lion', 'walrus']),
...    ('plant', ['grass', 'maple tree'])
... ]
True
join(other, key=None, other_key=None)

Joins two sequences based on common field matches between the sequence and other. This is known as an “inner” join in SQL terminology.

Parameters:
  • other (Sequence) – A Sequence to join with the calling sequence.
  • key (function) – A function to retrieve the field to be used for matching between the two sequence. If key is None, then key will default to lambda x: x.
  • other_key (function) – A function to retrieve the field in other to be used for matching between the two sequence. If other_key is None, use key.
Returns:

A sequence of 2-tuples (a, b) where a is an element in self that matched element b in other (based on the given field keys).

Examples

>>> a = Sequence([
...     {'id': 1, 'name': 'John'},
...     {'id': 2, 'name': 'Nayeon'},
...     {'id': 3, 'name': 'Reza'}
... ])
>>> b = Sequence([
...     {'id': 1, 'age': 2},
...     {'id': 2, 'age': 3}
... ])
>>> res = a.join(b, key=lambda x: x['id']).collect()
>>> res == [
...     ({'id': 1, 'name': 'John'}, {'id': 1, 'age': 2}),
...     ({'id': 2, 'name': 'Nayeon'}, {'id': 2, 'age': 3})
... ]
True
keep_if(fn)

Lazily apply fn function to every element of iterable and keep only sequence elements where the function fn evaluates to True.

Parameters:fn – function Function with signature fn(element) -> bool to apply to every element of sequence. Keep all elements in the sequence where the fn function evaluates to True.
>>> seq = Sequence(range(5))
>>> seq = seq.keep_if(lambda v: v > 1)
>>> seq.collect()
[2, 3, 4]
map(fn)

Lazily apply fn function to every element of iterable

Parameters:fn (function) – Function with signature fn(element) to apply to every element of sequence.
>>> seq = Sequence(range(10))
>>> seq = seq.map(lambda v: v*2)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
next()

Eagerly returns next element in sequence (alias for first() function)

Examples:

Get next value in sequence:

>>> seq = Sequence(range(5))
>>> seq.next()
0
>>> seq.next()
1

Calling next on empty sequence returns None:

>>> seq = Sequence([])
>>> seq.next()
peek(count=None)

Returns list of count elements without advancing sequence iterator. If count is None, return only the first element.

WARNING: this function will load up to count elements of your sequence into memory.

Examples

Peek at first element (notice iterator does not advance):

>>> seq = Sequence(range(10))
>>> seq.peek()
0
>>> seq.peek()
0

Peek at first 3 elements:

>>> seq = Sequence(range(10))
>>> seq.peek(3)
[0, 1, 2]
>>> seq.peek(3)
[0, 1, 2]
pmap(fn, workers=3, ordered=True)

Lazily apply fn function to every element of iterable, in parallel using multiprocess.dummy.Pool . The returned sequence may appear in a different order than the input sequence if you set ordered to False

THIS FUNCTION IS EXPERIMENTAL

Parameters:
  • fn (function) – Function with signature fn(element) -> element to apply to every element of sequence.
  • workers (int) – Number of parallel workers to use (default: 3). These workers are implemented as python threads.
  • ordered (bool) – Whether to yield results in the same order in which items arrive. You may get better performance by setting this to false (default: True).
>>> seq = Sequence(range(10))
>>> seq = seq.pmap(lambda v: v*2)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
>>> seq = Sequence(range(10))
>>> seq = seq.pmap(lambda v: v*2, workers=1, ordered=False)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
reduce(fn, initial=None)

Eagerly apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). If initial is present, it is placed before the items of the sequence in the calculation, and serves as a default when the sequence is empty.

Parameters:
  • fn (function) – Function with signature fn(acc, current_item) -> acc_next
  • initial (Any) – An initial value that acc will be set to. If not provided, this function will set the first element of the sequence as the initial value.

Examples

Reduce with accumulator initialized to first element:

>>> seq = Sequence(range(3))
>>> seq.reduce(lambda acc, item: acc + item)
3

Reduce with accumulator set to a custom initial value:

>>> seq = Sequence(range(3))
>>> seq.reduce(lambda acc, item: acc + item, initial=10)
13
reset()

Uses the internal cache of the Sequence to reset to beginning of iterator

>>> seq = Sequence([1, 2, 3])
>>> _ = seq.cache()
>>> seq.collect()
[1, 2, 3]
>>> _  = seq.reset()
>>> seq.collect()
[1, 2, 3]
shuffle()

Eagerly shuffles your sequence and returns a newly created sequence containing the shuffled items. WARNING: this function loads the entirety of your sequence into memory.

>>> import random
>>> random.seed(0)
>>> seq = Sequence(range(5))
>>> seq.shuffle().collect()
[2, 1, 0, 4, 3]
sort(key=None)

Eagerly sorts your sequence and returns a newly created sequence containing the sorted items. WARNING: this function loads the entirety of your sequence into memory.

>>> seq = Sequence([2, 1, 0, 4, 3])
>>> seq.sort().collect()
[0, 1, 2, 3, 4]
take(count)

Lazily returns a sequence of the first count elements.

>>> seq = Sequence(range(5))
>>> seq.take(2).collect()
[0, 1]
window(size, stride=1)

Lazily slides and yields a window of length size over sequence. This function will drop any remainder if the sequence ends before a window with size has been filled.

Parameters:
  • size (int) – The window size.
  • stride (int) – How many elements to skip for each advancement in window position.

Examples

>>> seq = Sequence(range(10))
>>> seq.window(2).collect()
[[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9]]
>>> seq = Sequence(range(10))
>>> seq.window(3, stride=2).collect()
[[0, 1, 2], [2, 3, 4], [4, 5, 6], [6, 7, 8]]
>>> seq = Sequence(range(10))
>>> seq.window(2, stride=4).collect()
[[0, 1], [4, 5], [8, 9]]
>>> seq = Sequence(range(10))
>>> seq.window(1, stride=1).collect()
[[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]
zip_with_index()

Add an to each item in sequence

>>> seq = Sequence(['a', 'b', 'c'])
>>> seq.zip_with_index().collect()
[(0, 'a'), (1, 'b'), (2, 'c')]