Reference

analyzrclient.analyzer module

class analyzrclient.analyzer.Analyzer(host=None, verbose=False)

Bases: object

Parent class for Analyzr client. This is the class that should be instantiated by the client. For detailed methods, see the appropriate runner class.

Parameters:: host (str, required) – the FQDN for your API tenant
Verbose:: Set to true for verbose output

client_version()

Provide client version info

Returns:: JSON object with API version and other metadata

login(verbose=False)

Log in to Analyzr API

Parameters:: verbose (boolean, optional) – Set to true for verbose output Set to True for verbose screen output
Return type:: None

logout(verbose=False)

Log out of Analyzr API

Parameters:: verbose (boolean, optional) – Set to true for verbose output Set to True for verbose screen output
Return type:: None

version()

Provide API version info

Returns:: JSON object with API version and other metadata

analyzrclient.runner_cluster module

class analyzrclient.runner_cluster.ClusterRunner(client=None, base_url=None)

Bases: BaseRunner

Run the clustering pipeline

Parameters:

client (SamlSsoAuthClient, required) – SAML SSO client object
base_url (str, required) – Base URL for the Analyzr API tenant

check_status(request_id=None, client_id=None, verbose=False, data=None)

Check the status of a specific model run. Data is homomorphically encoded by default

Parameters:

request_id (str, required) – UUID for a specific model object
client_id (string, required) – Short name for account being used. Used for reporting purposes only
verbose (boolean, optional) – Set to true for verbose output
data (DataFrame, optional) – if data is not None, cluster IDs will be appended and stats compiled

Returns:

JSON object with the following attributes: status (can be Pending, Complete, or Failed), request_id (UUID provided with initial request), data (dataframe with clustering results, if applicable)

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], buffer_batch_size=1000, cluster_batch_size=None, timeout=600, verbose=False, compressed=False, staging=True)

Assign cluster IDs to dataset using a pre-trained model

Parameters:

df (DataFrame, required) – dataframe containing dataset to be clustered. The data is homomorphically encrypted by the client prior to being transferred to the API buffer
model_id (str, required) – UUID for a specific model object
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes
categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer :param buffer_batch_size: batch size for the purpose of uploading data from the client to the server’s buffer
cluster_batch_size (int, optional) – batch size for the purpose of clustering the data provided in the dataframe df
timeout (int, optional) – client will keep polling API for a period of timeout seconds
verbose (boolean, optional) – Set to true for verbose output
compressed (boolean, optional) – perform additional compression when uploading data to buffer
staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

run(df, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], algorithm='pca-kmeans', n_components=5, buffer_batch_size=1000, cluster_batch_size=None, verbose=False, poll=True, compressed=False, staging=True)

Run clustering algorithm on user-provided dataset

Parameters:

df (DataFrame, required) – dataframe containing dataset to be clustered. The data is homomorphically encrypted by the client prior to being transferred to the API buffer
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes
categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
algorithm (string, required) – can be any of the following: pca-kmeans, incremental-pca-kmeans, pca-kmeans-simple, kmeans, minibatch-kmeans, gaussian-mixture, birch, dbscan, optics, mean-shift, spectral-clustering, hierarchical-agglomerative. Algorithms are sourced from Scikit-Learn unless otherwise indicated.
n_components (int, optional) – number of clustering components
buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer :param buffer_batch_size: batch size for the purpose of uploading data from the client to the server’s buffer
cluster_batch_size (int, optional) – batch size for the purpose of clustering the data provided in the dataframe df
verbose (boolean, optional) – Set to true for verbose output
poll (boolean, optional) – keep polling API while the job is being run (default is True)
compressed (boolean, optional) – perform additional compression when uploading data to buffer
staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), request_id: same as model_id (provided for backward compatibility), data: original dataset with cluster IDs appended distances: distance matrix showing inter-cluster distances (centroid to centroid) stats: count, frequency, and attribute averages by cluster ID

analyzrclient.runner_propensity module

class analyzrclient.runner_propensity.PropensityRunner(client=None, base_url=None)

Bases: BaseRunner

Run the propensity scoring pipeline

Parameters:

client (SamlSsoAuthClient, required) – SAML SSO client object
base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False, encoding=True)

Check the status of a specific model run, and retrieve results if model run is complete

Parameters:

model_id (str, required) – UUID for a specific model object
client_id (string, required) – Short name for account being used. Used for reporting purposes only
verbose (boolean, optional) – Set to true for verbose output
encoding (boolean, optional) – Decode results with homomorphic encryption

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), features (table of feature importances), confusion_matrix (confusion matrix using test dataset), stats (error stats including accuracy, precision, recall, F1, AUC, Gini), roc (receiver operating characteristic curve)

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, api_batch_size=2000, verbose=False, timeout=600, step=2, compressed=False, staging=True, encoding=True)

Predict probabilities of outcome (propensities) for user-specified dataset using a pre-trained model

Parameters:

df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True
model_id (str, required) – UUID for a specific model object. Refers to a model that was previously trained
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes
categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df
buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer
api_batch_size (int, optional) – Batch size for the purpose of processing data in the API
verbose (boolean, optional) – Set to true for verbose output
timeout (int, optional) – Client will keep polling API for a period of timeout seconds
step (int, optional) – Polling interval, in seconds
compressed (boolean, optional) – Perform additional compression when uploading data to buffer
staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)
encoding (boolean, optional) – Encode and decode data with homomorphic encryption

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

train(df, client_id=None, idx_var=None, outcome_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], algorithm='random-forest-classifier', train_size=0.5, buffer_batch_size=1000, verbose=False, timeout=600, step=2, poll=True, smote=False, param_grid=None, scoring=None, n_splits=None, compressed=False, staging=True, encoding=True)

Train propensity model on user-provided dataset

Parameters:

df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes
outcome_var (string, required) – Name of dependent variable, usually a boolean variable set to 0 or 1
categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df
algorithm (string, required) – Can be any of the following: random-forest-classifier, gradient-boosting-classifier, xgboost-classifier, ada-boost-classifier, extra-trees-classifier, logistic-regression-classifier. Algorithms are sourced from Scikit-Learn unless otherwise indicated. Additional algorithms may be available
train_size (float, optional) – Share of training dataset assigned to training vs. testing, e.g. if train_size is set to 0.8 80% of the dataset will be assigned to training and 20% will be randomly set aside for testing and validation
buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer
verbose (boolean, optional) – Set to true for verbose output
timeout (int, optional) – Client will keep polling API for a period of timeout seconds
step (int, optional) – Polling interval, in seconds
poll (boolean, optional) – Keep polling API while the job is being run (default is True)
smote (boolean, optional) – Apply SMOTE pre-processing
param_grid (JSON object, optional) – Parameter grid to be used during the cross-validation grid search (hypertuning). The default is algorithm-specific and set by the API.
scoring (string, optional) – Scoring methodology to evaluate the performance of the cross-validated model. Common methodologies include roc_auc, accuracy, and f1. Default is algorithm-specific and set by the API
n_splits (int, optional) – Number of folds. Must be at least 2, defaults to 10
compressed (boolean, optional) – Perform additional compression when uploading data to buffer. Defaults to False
staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database. Defaults to True
encoding (boolean, optional) – encode and decode data with homomorphic encryption. Defaults to True

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), features (table of feature importances), confusion_matrix (confusion matrix using test dataset), stats (error stats including accuracy, precision, recall, F1, AUC, Gini), roc (receiver operating characteristic curve)

analyzrclient.runner_regression module

class analyzrclient.runner_regression.RegressionRunner(client=None, base_url=None)

Bases: BaseRunner

Run the regression pipeline

Parameters:

client (SamlSsoAuthClient, required) – SAML SSO client object
base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False)

Check the status of a specific model run, and retrieve results if model run is complete. Data is homomorphically encoded by default

Parameters:

model_id (str, required) – UUID for a specific model object
client_id (string, required) – Short name for account being used. Used for reporting purposes only
verbose (boolean, optional) – Set to true for verbose output

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), features (table of feature importances), stats (error stats including R2, p, RMSE, MSE, MAE, MAPE),

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, verbose=False, timeout=600, step=2, compressed=False, staging=True)

Predict outcomes for user-specified dataset using a pre-trained model. The data is homomorphically encrypted by the client prior to being transferred to the API buffer by default

Parameters:

df (DataFrame, required) – dataframe containing dataset to be used for training.
model_id (str, required) – UUID for a specific model object. Refers to a model that was previously trained
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes
categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
bool_vars (string[], optional) – array of field names identifying boolean fields in the dataframe df
buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer
verbose (boolean, optional) – Set to true for verbose output
timeout (int, optional) – client will keep polling API for a period of timeout seconds
step (int, optional) – polling interval, in seconds
compressed (boolean, optional) – perform additional compression when uploading data to buffer
staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

train(df, client_id=None, idx_var=None, outcome_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], algorithm='random-forest-regression', train_size=0.5, buffer_batch_size=1000, param_grid=None, verbose=False, timeout=600, poll=True, step=2, compressed=False, staging=True)

Train regression model on user-provided dataset

Parameters:

df (DataFrame, required) – dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API by default
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes
outcome_var (string, required) – name of dependent variable, usually a numerical variable
categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df
bool_vars (string[], optional) – array of field names identifying boolean fields in the dataframe df
algorithm (string, required) – can be any of the following: random-forest-regression, gradient-boosting-regression, xgboost-regression, linear-regression-classifier. Algorithms are sourced from Scikit-Learn unless otherwise indicated. Additional algorithms may be available
train_size (float, optional) – Share of training dataset assigned to training vs. testing, e.g. if train_size is set to 0.8 80% of the dataset will be assigned to training and 20% will be randomly set aside for testing and validation
buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer
param_grid (JSON object, optional) – Parameter grid to be used during the cross-validation grid search (hypertuning). The default is algorithm-specific and set by the API.
verbose (boolean, optional) – Set to true for verbose output
timeout (int, optional) – client will keep polling API for a period of timeout seconds
poll (boolean, optional) – keep polling API while the job is being run (default is True)
step (int, optional) – polling interval, in seconds
compressed (boolean, optional) – perform additional compression when uploading data to buffer
staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), features (table of feature importances), stats (error stats including accuracy, precision, recall, F1, AUC, Gini),

analyzrclient.runner_psm module

class analyzrclient.runner_psm.PropensityScoreMatchingRunner(client=None, base_url=None)

Bases: BaseRunner

Run the propensity score matching pipeline

Parameters:

client (SamlSsoAuthClient, required) – SAML SSO client object
base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False, encoding=True)

Check the status of a specific model run, and retrieve results if model run is complete

Parameters:

model_id (str, required) – UUID for a specific model object
client_id (string, required) – Short name for account being used. Used for reporting purposes only
verbose (boolean, optional) – Set to true for verbose output
encoding (boolean, optional) – Decode results with homomorphic encryption

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), atx (average treatment effects), raw (dataset stats prior to matching), misc (miscellaneous error stats including accuracy, precision, recall, F1, AUC, Gini),

train(df, client_id=None, idx_var=None, outcome_var=None, treatment_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, verbose=False, timeout=600, step=2, poll=True, compressed=False, staging=True, encoding=True)

Train propensity score matching model on user-provided dataset

Parameters:

df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True
client_id (string, required) – Short name for account being used. Used for reporting purposes only
idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes
outcome_var (string, required) – Name of dependent variable, usually a boolean variable set to 0 or 1
treatment_var (string, required) – Name of treatment variable, usually a boolean variable set to 0 or 1
categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df
bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df
buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer
verbose (boolean, optional) – Set to true for verbose output
timeout (int, optional) – Client will keep polling API for a period of timeout seconds
step (int, optional) – Polling interval, in seconds
poll (boolean, optional) – Keep polling API while the job is being run (default is True)
compressed (boolean, optional) – Perform additional compression when uploading data to buffer. Defaults to False
staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database. Defaults to True
encoding (boolean, optional) – encode and decode data with homomorphic encryption. Defaults to True

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), atx (average treatment effects), raw (dataset stats prior to matching), misc (miscellaneous error stats including accuracy, precision, recall, F1, AUC, Gini), bins (histogram of matched propensity scores)

Module contents

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.