Збереження / завантаження scipy розрідженого csr_matrix у портативному форматі даних

Question 1

Як ви економите / завантажуєте скупий скупого csr_matrixв портативний формат? Розріджена матриця scipy створена на Python 3 (64-розрядна Windows) для роботи на Python 2 (64-розрядна Linux). Спочатку я використовував pickle (з protocol = 2 і fix_imports = True), але це не працювало, переходячи з Python 3.2.2 (Windows 64-bit) до Python 2.7.2 (Windows 32-bit) і отримав помилку:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

Далі, спробували, numpy.saveа numpy.loadтакож, як scipy.io.mmwrite()і, scipy.io.mmread()і жоден із цих методів не працював.

Question 2

редагувати: SciPy 1.19 тепер має scipy.sparse.save_npzі scipy.sparse.load_npz.

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

Для обох функцій fileаргументом може бути також файлоподібний об'єкт (тобто результат open) замість імені файлу.

Отримав відповідь від групи користувачів Scipy:

Csr_matrix має 3 атрибути даних , які мають значення: .data, .indicesі .indptr. Всі це прості ndarrays, тому numpy.saveбуде працювати над ними. Збережіть три масиви за допомогою numpy.saveабо numpy.savez, завантажте їх назад numpy.load, а потім відтворіть об’єкт розрідженої матриці за допомогою:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

Так наприклад:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

Question 3

Хоча ви пишете scipy.io.mmwriteі scipy.io.mmreadне працюєте на вас, я просто хочу додати, як вони працюють. Це питання є ні. 1 хіт Google, тому я сам почав з np.savezі pickle.dumpдо переходу на прості та очевидні функції scipy. Вони працюють на мене, і їх не слід контролювати тим, хто їх ще не пробував.

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

Question 4

Ось порівняння ефективності трьох найбільш прихильних відповідей за допомогою блокнота Jupyter. Вхідні дані - це 1M x 100K випадкова розріджена матриця щільністю 0,001, що містить 100M ненульових значень:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(зверніть увагу, що формат змінено з csr на coo).

`np.savez` / `np.load`

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

`cPickle`

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Примітка : cPickle не працює з дуже великими об’єктами (див. Цю відповідь ). З мого досвіду, це не спрацювало для матриці 2,7 М х 50 тис. Із 270 М ненульовими значеннями. np.savezрішення працювало добре.

Висновок

(на основі цього простого тесту для матриць КСВ) cPickleє найшвидшим методом, але він не працює з дуже великими матрицями, np.savezлише трохи повільніший, хоча io.mmwriteнабагато повільніший, створює більший файл і відновлює до неправильного формату. Тут np.savezі переможець.

Question 5

Тепер ви можете використовувати scipy.sparse.save_npz: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

Question 6

Якщо припустити, що у вас є scipy на обох машинах, ви можете просто використовувати pickle.

Однак обов’язково вказуйте двійковий протокол при маринуванні масивів numpy. Інакше ви отримаєте величезний файл.

У кожному разі, ви повинні мати можливість зробити це:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

Потім ви можете завантажити його:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

Question 7

Починаючи з scipy 0.19.0, ви можете зберігати та завантажувати розріджені матриці таким чином:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

Question 8

EDIT Очевидно, це досить просто:

def sparse_matrix_tuples(m):
    yield from m.todok().items()

Що дасть ((i, j), value)кортежі, які легко серіалізувати та десеріалізувати. Не впевнений, як він порівнює продуктивність із наведеним нижче кодом csr_matrix, але це, безумовно, простіше. Я залишаю оригінальну відповідь нижче, оскільки, сподіваюся, вона є інформативною.

Додавання моїх двох центів: для мене npzне є портативним, оскільки я не можу використовувати його, щоб легко експортувати свою матрицю до клієнтів, які не є Python (наприклад, PostgreSQL - радий бути виправленим). Отже, я хотів би отримати вихідний файл CSV для розрідженої матриці (подібно до того, як ви отримаєте її print()від розрідженої матриці). Як цього досягти, залежить від подання розрідженої матриці. Для матриці CSR наступний код виводить вихід CSV. Ви можете адаптуватися до інших подань.

import numpy as np

def csr_matrix_tuples(m):
    # not using unique will lag on empty elements
    uindptr, uindptr_i = np.unique(m.indptr, return_index=True)
    for i, (start_index, end_index) in zip(uindptr_i, zip(uindptr[:-1], uindptr[1:])):
        for j, data in zip(m.indices[start_index:end_index], m.data[start_index:end_index]):
            yield (i, j, data)

for i, j, data in csr_matrix_tuples(my_csr_matrix):
    print(i, j, data, sep=',')

Це save_npzте, що я тестував , приблизно в 2 рази повільніше, ніж у поточній реалізації.

Question 9

Це те, що я використовував для збереження lil_matrix.

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

Треба сказати, я виявив, що np.load (..) NumPy дуже повільний . Це моє поточне рішення, я відчуваю, що працює набагато швидше:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

Question 10

Це працює для мене:

import numpy as np
import scipy.sparse as sp
x = sp.csr_matrix([1,2,3])
y = sp.csr_matrix([2,3,4])
np.savez(file, x=x, y=y)
npz = np.load(file)

>>> npz['x'].tolist()
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>

>>> npz['x'].tolist().toarray()
array([[1, 2, 3]], dtype=int64)

Фокус полягав у тому, щоб викликати .tolist()перетворення масиву об’єкта shape 0 у вихідний об’єкт.

Question 11

Мене попросили надіслати матрицю у простому загальному форматі:

<x,y,value>

Я закінчив з цим:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

Збереження / завантаження scipy розрідженого csr_matrix у портативному форматі даних

io.mmwrite / io.mmread

np.savez / np.load

cPickle

Висновок

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`