datapad.Sequence#

class datapad.Sequence(iterable: Iterable | None = None)#

Bases: object

The core object in datapad used to wrap sequence-like data types in a fluent-style API.

Methods

__init__

Instantiates a new Sequence object.

all

Returns a standard python iterator that you can use to lazily iterate over your sequence of data

batch

Lazily combines elements in sequence into a list of length size.

cache

Greedily stores results of self.collect() in an internal variable that can later be used to reset the Sequence to the beginning of the iterator.

collect

Eagerly returns all elements in sequence

concat

Concatenates another sequence to the end of this sequence

count

Eagerly count number of elements in sequence

distinct

Eagerly returns a new sequence with unique values

drop

Lazily skip or drop over count elements.

drop_if

Lazily apply fn function to every element of iterable and drop sequence elements where the function fn evaluates to True.

dump

Dump sequence into a sink function that will greedily consume elements.

filter

This is an alias for the Sequence.keep_if function

first

Eagerly returns first element in sequence

flatmap

Lazily apply fn function to every element of iterable and chain the output into a single flattend sequence.

groupby

Groups sequence using key function,

join

Joins two sequences based on common field matches between the sequence and other.

keep_if

Lazily apply fn function to every element of iterable and keep only sequence elements where the function fn evaluates to True.

map

Lazily apply fn function to every element of iterable

next

Eagerly returns next element in sequence (alias for first() function)

peek

Returns list of count elements without advancing sequence iterator.

pipe

Pass sequence to a function that will iterate over each element in the sequence and return another sequence.

pmap

Lazily apply fn function to every element of iterable, in parallel using python's multiprocessing package.

progress

Output progess and transparently pass through elements of sequence.

reduce

Eagerly apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value.

reset

Uses the internal cache of the Sequence to reset to beginning of iterator

shuffle

Eagerly shuffles your sequence and returns a newly created sequence containing the shuffled items.

sort

Eagerly sorts your sequence and returns a newly created sequence containing the sorted items.

take

Lazily returns a sequence of the first count elements.

window

Lazily slides and yields a window of length size over sequence.

zip_with_index

Add an index to each item in sequence (e.g.

all()#

Returns a standard python iterator that you can use to lazily iterate over your sequence of data

>>> seq = Sequence(range(10))
>>> seq = seq.map(lambda v: v*2)
>>> i = 0
>>> for item in seq.all():
...     i += item
>>> i
90

batch(size)#

Lazily combines elements in sequence into a list of length size. This function will drop any remainder if the sequence ends before a batch with size has been created.

Parameters:: size (int) – The batch size.

Examples

>>> seq = Sequence(range(10))
>>> seq.batch(3).collect()
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]

cache(overwrite=False)#

Greedily stores results of self.collect() in an internal variable that can later be used to reset the Sequence to the beginning of the iterator. This is useful if you want to make multiple passes over the data. This function is meant to be used in conjunction with reset().

Parameters:: overwrite (bool) – By default multiple calls to the cache function will only cache the initial state of the iterator. If you set overwrite to True, this will take the current state of the iterator and use it’s results to save it to the internal cache.

>>> seq = Sequence([1,2,3])
>>> _ = seq.cache()
>>> seq.collect()
[1, 2, 3]
>>> _ = seq.reset()
>>> seq.collect()
[1, 2, 3]
>>> _ = seq.reset()
>>> seq.next()
1
>>> _ = seq.cache(overwrite=True)
>>> seq.collect()
[2, 3]
>>> _ = seq.reset()
>>> seq.collect()
[2, 3]

collect()#

Eagerly returns all elements in sequence

>>> seq = Sequence(range(10))
>>> seq.collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> seq.collect()
[]

concat(seq)#

Concatenates another sequence to the end of this sequence

Examples

Concat two sequences together:

>>> s1 = Sequence(['a', 'b', 'c'])
>>> s2 = Sequence(range(3))
>>> s3 = s2.concat(s1)
>>> s3.collect()
[0, 1, 2, 'a', 'b', 'c']

Concat sequence with itself:

>>> seq = Sequence(range(5))
>>> seq = seq.concat(seq)
>>> seq.collect()
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

count(distinct=False)#

Eagerly count number of elements in sequence

Parameters:: distinct – bool If True, counts occurence of each distinct value in sequence.
Returns:: Either an integer count or a new sequence of tuples where the first value is the unique element and the second value is the number of times that element appeared in the sequence.

>>> seq = Sequence(range(5))
>>> seq.count()
5

>>> seq = Sequence(['a', 'a', 'b', 'b', 'c', 'c'])
>>> seq.count(distinct=True).collect()
[('a', 2), ('b', 2), ('c', 2)]

distinct()#

Eagerly returns a new sequence with unique values

>>> seq = Sequence(['a', 'a', 'b', 'b', 'c', 'c'])
>>> seq.distinct().collect()
['a', 'b', 'c']

drop(count)#

Lazily skip or drop over count elements.

>>> seq = Sequence(range(5))
>>> seq.collect()
[0, 1, 2, 3, 4]

>>> seq = Sequence(range(5))
>>> seq = seq.drop(2)
>>> seq.collect()
[2, 3, 4]

drop_if(fn)#

Lazily apply fn function to every element of iterable and drop sequence elements where the function fn evaluates to True.

Parameters:: fn – function Function with signature fn(element) -> bool to apply to every element of sequence. Drop all elements in the sequence where the fn function evaluates to True.

>>> seq = Sequence(range(5))
>>> seq = seq.drop_if(lambda v: v > 1)
>>> seq.collect()
[0, 1]

dump(sink: Callable)#

Dump sequence into a sink function that will greedily consume elements. The purpose of this function is mainly to be used as a way to push sequence elements to file writers or other external systems.

Parameters:: sink – Any callable that will take a sequence to consume as its first argument.

>>> sink = dp.io.JsonSink("data.jsonl", lines=True) 
>>> seq.dump(sink) 

filter(fn)#

This is an alias for the Sequence.keep_if function

>>> seq = Sequence(range(5))
>>> seq = seq.filter(lambda v: v > 1)
>>> seq.collect()
[2, 3, 4]

first()#

Eagerly returns first element in sequence

Examples

Get first value in sequence:

>>> seq = Sequence(range(5))
>>> seq.first()
0
>>> seq.first()
1

Calling first on empty sequence returns None:

>>> seq = Sequence([])
>>> seq.first()

flatmap(fn)#

Lazily apply fn function to every element of iterable and chain the output into a single flattend sequence.

Parameters:: fn (function) – Function with signature fn(element) -> iterable(element) to apply to every element of sequence.

Examples

>>> seq = Sequence(range(5))
>>> seq = seq.flatmap(lambda v: [v,v])
>>> seq.collect()
[0, 0, 1, 1, 2, 2, 3, 3, 4, 4]

groupby(key=None, getter=None, eager_group=True)#

Groups sequence using key function,

Note: you must ensure elements are sorted by groups before calling this function.

Parameters:

key – function Function used to determine what to use as a key for grouping
getter – function Function to be applied to each element of a group
eager_group – bool, default=True If true, eagerly convert a group from a lazy Sequence to a fully-realized list.

Examples

Simple usage:

>>> from pprint import pprint
>>> seq = Sequence(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'd'])
>>> res = seq.sort().groupby(key=lambda x: x).collect()
>>> res == [
...    ('a', ['a', 'a', 'a']),
...    ('b', ['b', 'b']),
...    ('c', ['c']),
...    ('d', ['d', 'd']),
... ]
True

Grouping with getter function:

>>> things = [("animal", "lion"),
...           ("plant", "maple tree"),
...           ("animal", "walrus"),
...           ("plant", "grass")]
>>> seq = Sequence(things)
>>> res = seq.sort().groupby(key=lambda x: x[0], getter=lambda x: x[1]).collect()
>>> res == [
...    ('animal', ['lion', 'walrus']),
...    ('plant', ['grass', 'maple tree'])
... ]
True

join(other, key=None, other_key=None)#

Joins two sequences based on common field matches between the sequence and other. This is known as an “inner” join in SQL terminology.

Parameters:

other (Sequence) – A Sequence to join with the calling sequence.
key (function) – A function to retrieve the field to be used for matching between the two sequence. If key is None, then key will default to lambda x: x.
other_key (function) – A function to retrieve the field in other to be used for matching between the two sequence. If other_key is None, use key.

Returns:

A sequence of 2-tuples (a, b) where a is an element in self that matched element b in other (based on the given field keys).

Examples

>>> a = Sequence([
...     {'id': 1, 'name': 'John'},
...     {'id': 2, 'name': 'Nayeon'},
...     {'id': 3, 'name': 'Reza'}
... ])
>>> b = Sequence([
...     {'id': 1, 'age': 2},
...     {'id': 2, 'age': 3}
... ])
>>> res = a.join(b, key=lambda x: x['id']).collect()
>>> res == [
...     ({'id': 1, 'name': 'John'}, {'id': 1, 'age': 2}),
...     ({'id': 2, 'name': 'Nayeon'}, {'id': 2, 'age': 3})
... ]
True

keep_if(fn)#

Lazily apply fn function to every element of iterable and keep only sequence elements where the function fn evaluates to True.

Parameters:: fn – function Function with signature fn(element) -> bool to apply to every element of sequence. Keep all elements in the sequence where the fn function evaluates to True.

>>> seq = Sequence(range(5))
>>> seq = seq.keep_if(lambda v: v > 1)
>>> seq.collect()
[2, 3, 4]

map(fn)#

Lazily apply fn function to every element of iterable

Parameters:: fn (function) – Function with signature fn(element) to apply to every element of sequence.

>>> seq = Sequence(range(10))
>>> seq = seq.map(lambda v: v*2)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

next()#

Eagerly returns next element in sequence (alias for first() function)

Examples:
Get next value in sequence:
>>> seq = Sequence(range(5))
>>> seq.next()
0
>>> seq.next()
1
Calling next on empty sequence returns None:
>>> seq = Sequence([])
>>> seq.next()

peek(count=None)#

Returns list of count elements without advancing sequence iterator. If count is None, return only the first element.

WARNING: this function will load up to count elements of your sequence into memory.

Examples

Peek at first element (notice iterator does not advance):

>>> seq = Sequence(range(10))
>>> seq.peek()
0
>>> seq.peek()
0

Peek at first 3 elements:

>>> seq = Sequence(range(10))
>>> seq.peek(3)
[0, 1, 2]
>>> seq.peek(3)
[0, 1, 2]

pipe(pipe: Callable[[Sequence], Sequence | Iterable]) → Sequence#

Pass sequence to a function that will iterate over each element in the sequence and return another sequence. This is often used to abstract out complicated sequence processing pipelines into sub-units or to re-use previously defined python iteration functions.

Parameters:: pipe – The function MUST return another Sequence or an iterable.

>>> def p1(seq):
...     for elem in seq:
...         yield elem*2
>>> s = Sequence([1,2,3,4,5]).pipe(p1).collect()
>>> s
[2, 4, 6, 8, 10]

>>> def p2(seq):
...     return seq.flatmap(lambda v: v)                          .map(lambda v: v*2)                          .batch(2)
>>> s = Sequence([[1, 2, 3, 4, 5, 6]]).pipe(p2).collect()
>>> s
[[2, 4], [6, 8], [10, 12]]

pmap(fn, workers=3, ordered=True, wtype='thread')#

Lazily apply fn function to every element of iterable, in parallel using python’s multiprocessing package. The returned sequence may appear in a different order than the input sequence if you set ordered to False.

Parameters:

fn (function) – Function with signature fn(element) -> element to apply to every element of sequence.
workers (int) – Number of parallel workers to use (default: 3). These workers are implemented as python threads.
ordered (bool) – Whether to yield results in the same order in which items arrive. You may get better performance by setting this to false (default: True).
wtype ("thread" | "process") – The worker type to use (default: “thread”). Please note, if you are using “process”, any functions passed to pmap must be defined concretely inside your script. lambda functions can not be used due to limitations of python multiprocess pickling. The worker type “thread” is good for use with io-bound tasks (such as reading data) from urls.

>>> seq = Sequence(range(10))
>>> seq = seq.pmap(lambda v: v*2)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

>>> seq = Sequence(range(10))
>>> seq = seq.pmap(lambda v: v*2, workers=1, ordered=False)
>>> seq.collect()
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

progress()#

Output progess and transparently pass through elements of sequence.

>>> seq = Sequence(range(1000))
>>> _ = seq.map(lambda x: x*2).progress().collect()

reduce(fn, initial=None)#

Eagerly apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). If initial is present, it is placed before the items of the sequence in the calculation, and serves as a default when the sequence is empty.

Parameters:

fn (function) – Function with signature fn(acc, current_item) -> acc_next
initial (Any) – An initial value that acc will be set to. If not provided, this function will set the first element of the sequence as the initial value.

Examples

Reduce with accumulator initialized to first element:

>>> seq = Sequence(range(3))
>>> seq.reduce(lambda acc, item: acc + item)
3

Reduce with accumulator set to a custom initial value:

>>> seq = Sequence(range(3))
>>> seq.reduce(lambda acc, item: acc + item, initial=10)
13

reset()#

Uses the internal cache of the Sequence to reset to beginning of iterator

>>> seq = Sequence([1, 2, 3])
>>> _ = seq.cache()
>>> seq.collect()
[1, 2, 3]
>>> _  = seq.reset()
>>> seq.collect()
[1, 2, 3]

shuffle()#

Eagerly shuffles your sequence and returns a newly created sequence containing the shuffled items. WARNING: this function loads the entirety of your sequence into memory.

>>> import random
>>> random.seed(0)
>>> seq = Sequence(range(5))
>>> seq.shuffle().collect()
[2, 1, 0, 4, 3]

sort(key: Callable | None = None, reverse=False)#

Eagerly sorts your sequence and returns a newly created sequence containing the sorted items. WARNING: this function loads the entirety of your sequence into memory.

>>> seq = Sequence([2, 1, 0, 4, 3])
>>> seq.sort().collect()
[0, 1, 2, 3, 4]

take(count)#

Lazily returns a sequence of the first count elements.

>>> seq = Sequence(range(5))
>>> seq.take(2).collect()
[0, 1]

window(size, stride=1)#

Lazily slides and yields a window of length size over sequence. This function will drop any remainder if the sequence ends before a window with size has been filled.

Parameters:

size (int) – The window size.
stride (int) – How many elements to skip for each advancement in window position.

Examples

>>> seq = Sequence(range(10))
>>> seq.window(2).collect()
[[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9]]

>>> seq = Sequence(range(10))
>>> seq.window(3, stride=2).collect()
[[0, 1, 2], [2, 3, 4], [4, 5, 6], [6, 7, 8]]

>>> seq = Sequence(range(10))
>>> seq.window(2, stride=4).collect()
[[0, 1], [4, 5], [8, 9]]

>>> seq = Sequence(range(10))
>>> seq.window(1, stride=1).collect()
[[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]

zip_with_index()#

Add an index to each item in sequence (e.g. enumerate)

>>> seq = Sequence(['a', 'b', 'c'])
>>> seq.zip_with_index().collect()
[(0, 'a'), (1, 'b'), (2, 'c')]

`__init__`	Instantiates a new Sequence object.
`all`	Returns a standard python iterator that you can use to lazily iterate over your sequence of data
`batch`	Lazily combines elements in sequence into a list of length size.
`cache`	Greedily stores results of self.collect() in an internal variable that can later be used to reset the Sequence to the beginning of the iterator.
`collect`	Eagerly returns all elements in sequence
`concat`	Concatenates another sequence to the end of this sequence
`count`	Eagerly count number of elements in sequence
`distinct`	Eagerly returns a new sequence with unique values
`drop`	Lazily skip or drop over count elements.
`drop_if`	Lazily apply fn function to every element of iterable and drop sequence elements where the function fn evaluates to True.
`dump`	Dump sequence into a sink function that will greedily consume elements.
`filter`	This is an alias for the Sequence.keep_if function
`first`	Eagerly returns first element in sequence
`flatmap`	Lazily apply fn function to every element of iterable and chain the output into a single flattend sequence.
`groupby`	Groups sequence using key function,
`join`	Joins two sequences based on common field matches between the sequence and `other`.
`keep_if`	Lazily apply fn function to every element of iterable and keep only sequence elements where the function fn evaluates to True.
`map`	Lazily apply fn function to every element of iterable
`next`	Eagerly returns next element in sequence (alias for first() function)
`peek`	Returns list of count elements without advancing sequence iterator.
`pipe`	Pass sequence to a function that will iterate over each element in the sequence and return another sequence.
`pmap`	Lazily apply fn function to every element of iterable, in parallel using python's multiprocessing package.
`progress`	Output progess and transparently pass through elements of sequence.
`reduce`	Eagerly apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value.
`reset`	Uses the internal cache of the Sequence to reset to beginning of iterator
`shuffle`	Eagerly shuffles your sequence and returns a newly created sequence containing the shuffled items.
`sort`	Eagerly sorts your sequence and returns a newly created sequence containing the sorted items.
`take`	Lazily returns a sequence of the first count elements.
`window`	Lazily slides and yields a window of length size over sequence.
`zip_with_index`	Add an index to each item in sequence (e.g.