Reference

analyzrclient.analyzer module

class analyzrclient.analyzer.Analyzer(host=None, verbose=False)

Bases: object

Parent class for Analyzr client. This is the class that should be instantiated by the client. For detailed methods, see the appropriate runner class.

Parameters:

host (str, required) – the FQDN for your API tenant

Verbose:

Set to true for verbose output

client_version()

Provide client version info

Returns:

JSON object with API version and other metadata

login(verbose=False)

Log in to Analyzr API

Parameters:

verbose (boolean, optional) – Set to true for verbose output Set to True for verbose screen output

Return type:

None

logout(verbose=False)

Log out of Analyzr API

Parameters:

verbose (boolean, optional) – Set to true for verbose output Set to True for verbose screen output

Return type:

None

version()

Provide API version info

Returns:

JSON object with API version and other metadata

analyzrclient.runner_cluster module

class analyzrclient.runner_cluster.ClusterRunner(client=None, base_url=None)

Bases: BaseRunner

Run the clustering pipeline

Parameters:
  • client (SamlSsoAuthClient, required) – SAML SSO client object

  • base_url (str, required) – Base URL for the Analyzr API tenant

check_status(request_id=None, client_id=None, verbose=False, data=None)

Check the status of a specific model run. Data is homomorphically encoded by default

Parameters:
  • request_id (str, required) – UUID for a specific model object

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • verbose (boolean, optional) – Set to true for verbose output

  • data (DataFrame, optional) – if data is not None, cluster IDs will be appended and stats compiled

Returns:

JSON object with the following attributes: status (can be Pending, Complete, or Failed), request_id (UUID provided with initial request), data (dataframe with clustering results, if applicable)

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], buffer_batch_size=1000, cluster_batch_size=None, timeout=600, verbose=False, compressed=False, staging=True)

Assign cluster IDs to dataset using a pre-trained model

Parameters:
  • df (DataFrame, required) – dataframe containing dataset to be clustered. The data is homomorphically encrypted by the client prior to being transferred to the API buffer

  • model_id (str, required) – UUID for a specific model object

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes

  • categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer :param buffer_batch_size: batch size for the purpose of uploading data from the client to the server’s buffer

  • cluster_batch_size (int, optional) – batch size for the purpose of clustering the data provided in the dataframe df

  • timeout (int, optional) – client will keep polling API for a period of timeout seconds

  • verbose (boolean, optional) – Set to true for verbose output

  • compressed (boolean, optional) – perform additional compression when uploading data to buffer

  • staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

run(df, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], algorithm='pca-kmeans', n_components=5, buffer_batch_size=1000, cluster_batch_size=None, verbose=False, poll=True, compressed=False, staging=True)

Run clustering algorithm on user-provided dataset

Parameters:
  • df (DataFrame, required) – dataframe containing dataset to be clustered. The data is homomorphically encrypted by the client prior to being transferred to the API buffer

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes

  • categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • algorithm (string, required) – can be any of the following: pca-kmeans, incremental-pca-kmeans, pca-kmeans-simple, kmeans, minibatch-kmeans, gaussian-mixture, birch, dbscan, optics, mean-shift, spectral-clustering, hierarchical-agglomerative. Algorithms are sourced from Scikit-Learn unless otherwise indicated.

  • n_components (int, optional) – number of clustering components

  • buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer :param buffer_batch_size: batch size for the purpose of uploading data from the client to the server’s buffer

  • cluster_batch_size (int, optional) – batch size for the purpose of clustering the data provided in the dataframe df

  • verbose (boolean, optional) – Set to true for verbose output

  • poll (boolean, optional) – keep polling API while the job is being run (default is True)

  • compressed (boolean, optional) – perform additional compression when uploading data to buffer

  • staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), request_id: same as model_id (provided for backward compatibility), data: original dataset with cluster IDs appended distances: distance matrix showing inter-cluster distances (centroid to centroid) stats: count, frequency, and attribute averages by cluster ID

analyzrclient.runner_propensity module

class analyzrclient.runner_propensity.PropensityRunner(client=None, base_url=None)

Bases: BaseRunner

Run the propensity scoring pipeline

Parameters:
  • client (SamlSsoAuthClient, required) – SAML SSO client object

  • base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False, encoding=True)

Check the status of a specific model run, and retrieve results if model run is complete

Parameters:
  • model_id (str, required) – UUID for a specific model object

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • verbose (boolean, optional) – Set to true for verbose output

  • encoding (boolean, optional) – Decode results with homomorphic encryption

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), features (table of feature importances), confusion_matrix (confusion matrix using test dataset), stats (error stats including accuracy, precision, recall, F1, AUC, Gini), roc (receiver operating characteristic curve)

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, api_batch_size=2000, verbose=False, timeout=600, step=2, compressed=False, staging=True, encoding=True)

Predict probabilities of outcome (propensities) for user-specified dataset using a pre-trained model

Parameters:
  • df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True

  • model_id (str, required) – UUID for a specific model object. Refers to a model that was previously trained

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes

  • categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df

  • buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer

  • api_batch_size (int, optional) – Batch size for the purpose of processing data in the API

  • verbose (boolean, optional) – Set to true for verbose output

  • timeout (int, optional) – Client will keep polling API for a period of timeout seconds

  • step (int, optional) – Polling interval, in seconds

  • compressed (boolean, optional) – Perform additional compression when uploading data to buffer

  • staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

  • encoding (boolean, optional) – Encode and decode data with homomorphic encryption

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

train(df, client_id=None, idx_var=None, outcome_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], algorithm='random-forest-classifier', train_size=0.5, buffer_batch_size=1000, verbose=False, timeout=600, step=2, poll=True, smote=False, param_grid=None, scoring=None, n_splits=None, compressed=False, staging=True, encoding=True)

Train propensity model on user-provided dataset

Parameters:
  • df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes

  • outcome_var (string, required) – Name of dependent variable, usually a boolean variable set to 0 or 1

  • categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df

  • algorithm (string, required) – Can be any of the following: random-forest-classifier, gradient-boosting-classifier, xgboost-classifier, ada-boost-classifier, extra-trees-classifier, logistic-regression-classifier. Algorithms are sourced from Scikit-Learn unless otherwise indicated. Additional algorithms may be available

  • train_size (float, optional) – Share of training dataset assigned to training vs. testing, e.g. if train_size is set to 0.8 80% of the dataset will be assigned to training and 20% will be randomly set aside for testing and validation

  • buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer

  • verbose (boolean, optional) – Set to true for verbose output

  • timeout (int, optional) – Client will keep polling API for a period of timeout seconds

  • step (int, optional) – Polling interval, in seconds

  • poll (boolean, optional) – Keep polling API while the job is being run (default is True)

  • smote (boolean, optional) – Apply SMOTE pre-processing

  • param_grid (JSON object, optional) – Parameter grid to be used during the cross-validation grid search (hypertuning). The default is algorithm-specific and set by the API.

  • scoring (string, optional) – Scoring methodology to evaluate the performance of the cross-validated model. Common methodologies include roc_auc, accuracy, and f1. Default is algorithm-specific and set by the API

  • n_splits (int, optional) – Number of folds. Must be at least 2, defaults to 10

  • compressed (boolean, optional) – Perform additional compression when uploading data to buffer. Defaults to False

  • staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database. Defaults to True

  • encoding (boolean, optional) – encode and decode data with homomorphic encryption. Defaults to True

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), features (table of feature importances), confusion_matrix (confusion matrix using test dataset), stats (error stats including accuracy, precision, recall, F1, AUC, Gini), roc (receiver operating characteristic curve)

analyzrclient.runner_regression module

class analyzrclient.runner_regression.RegressionRunner(client=None, base_url=None)

Bases: BaseRunner

Run the regression pipeline

Parameters:
  • client (SamlSsoAuthClient, required) – SAML SSO client object

  • base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False)

Check the status of a specific model run, and retrieve results if model run is complete. Data is homomorphically encoded by default

Parameters:
  • model_id (str, required) – UUID for a specific model object

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • verbose (boolean, optional) – Set to true for verbose output

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), features (table of feature importances), stats (error stats including R2, p, RMSE, MSE, MAE, MAPE),

predict(df, model_id=None, client_id=None, idx_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, verbose=False, timeout=600, step=2, compressed=False, staging=True)

Predict outcomes for user-specified dataset using a pre-trained model. The data is homomorphically encrypted by the client prior to being transferred to the API buffer by default

Parameters:
  • df (DataFrame, required) – dataframe containing dataset to be used for training.

  • model_id (str, required) – UUID for a specific model object. Refers to a model that was previously trained

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes

  • categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • bool_vars (string[], optional) – array of field names identifying boolean fields in the dataframe df

  • buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer

  • verbose (boolean, optional) – Set to true for verbose output

  • timeout (int, optional) – client will keep polling API for a period of timeout seconds

  • step (int, optional) – polling interval, in seconds

  • compressed (boolean, optional) – perform additional compression when uploading data to buffer

  • staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes: model_id (UUID provided with initial request), data2: original dataset with cluster IDs appended

train(df, client_id=None, idx_var=None, outcome_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], algorithm='random-forest-regression', train_size=0.5, buffer_batch_size=1000, param_grid=None, verbose=False, timeout=600, poll=True, step=2, compressed=False, staging=True)

Train regression model on user-provided dataset

Parameters:
  • df (DataFrame, required) – dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API by default

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – name of index field identifying unique record IDs in df for audit purposes

  • outcome_var (string, required) – name of dependent variable, usually a numerical variable

  • categorical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – array of field names identifying categorical fields in the dataframe df

  • bool_vars (string[], optional) – array of field names identifying boolean fields in the dataframe df

  • algorithm (string, required) – can be any of the following: random-forest-regression, gradient-boosting-regression, xgboost-regression, linear-regression-classifier. Algorithms are sourced from Scikit-Learn unless otherwise indicated. Additional algorithms may be available

  • train_size (float, optional) – Share of training dataset assigned to training vs. testing, e.g. if train_size is set to 0.8 80% of the dataset will be assigned to training and 20% will be randomly set aside for testing and validation

  • buffer_batch_size (int, optional) – batch size for the purpose of uploading data from the client to the server’s buffer

  • param_grid (JSON object, optional) – Parameter grid to be used during the cross-validation grid search (hypertuning). The default is algorithm-specific and set by the API.

  • verbose (boolean, optional) – Set to true for verbose output

  • timeout (int, optional) – client will keep polling API for a period of timeout seconds

  • poll (boolean, optional) – keep polling API while the job is being run (default is True)

  • step (int, optional) – polling interval, in seconds

  • compressed (boolean, optional) – perform additional compression when uploading data to buffer

  • staging (boolean, optional) – when set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database (default is True)

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), features (table of feature importances), stats (error stats including accuracy, precision, recall, F1, AUC, Gini),

analyzrclient.runner_psm module

class analyzrclient.runner_psm.PropensityScoreMatchingRunner(client=None, base_url=None)

Bases: BaseRunner

Run the propensity score matching pipeline

Parameters:
  • client (SamlSsoAuthClient, required) – SAML SSO client object

  • base_url (str, required) – Base URL for the Analyzr API tenant

check_status(model_id=None, client_id=None, verbose=False, encoding=True)

Check the status of a specific model run, and retrieve results if model run is complete

Parameters:
  • model_id (str, required) – UUID for a specific model object

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • verbose (boolean, optional) – Set to true for verbose output

  • encoding (boolean, optional) – Decode results with homomorphic encryption

Returns:

JSON object with the following attributes, as applicable: status (can be Pending, Complete, or Failed), atx (average treatment effects), raw (dataset stats prior to matching), misc (miscellaneous error stats including accuracy, precision, recall, F1, AUC, Gini),

train(df, client_id=None, idx_var=None, outcome_var=None, treatment_var=None, categorical_vars=[], numerical_vars=[], bool_vars=[], buffer_batch_size=1000, verbose=False, timeout=600, step=2, poll=True, compressed=False, staging=True, encoding=True)

Train propensity score matching model on user-provided dataset

Parameters:
  • df (DataFrame, required) – Dataframe containing dataset to be used for training. The data is homomorphically encrypted by the client prior to being transferred to the API buffer when encoding is set to True

  • client_id (string, required) – Short name for account being used. Used for reporting purposes only

  • idx_var (string, required) – Name of index field identifying unique record IDs in df for audit purposes

  • outcome_var (string, required) – Name of dependent variable, usually a boolean variable set to 0 or 1

  • treatment_var (string, required) – Name of treatment variable, usually a boolean variable set to 0 or 1

  • categorical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • numerical_vars (string[], required) – Array of field names identifying categorical fields in the dataframe df

  • bool_vars (string[], optional) – Array of field names identifying boolean fields in the dataframe df

  • buffer_batch_size (int, optional) – Batch size for the purpose of uploading data from the client to the server’s buffer

  • verbose (boolean, optional) – Set to true for verbose output

  • timeout (int, optional) – Client will keep polling API for a period of timeout seconds

  • step (int, optional) – Polling interval, in seconds

  • poll (boolean, optional) – Keep polling API while the job is being run (default is True)

  • compressed (boolean, optional) – Perform additional compression when uploading data to buffer. Defaults to False

  • staging (boolean, optional) – When set to True the API will use temporay secure cloud storage to buffer the data rather than a relational database. Defaults to True

  • encoding (boolean, optional) – encode and decode data with homomorphic encryption. Defaults to True

Returns:

JSON object with the following attributes, as applicable: model_id (UUID provided with initial request), atx (average treatment effects), raw (dataset stats prior to matching), misc (miscellaneous error stats including accuracy, precision, recall, F1, AUC, Gini), bins (histogram of matched propensity scores)

Module contents

Copyright (c) 2020-2023 Go2Market Insights, Inc d/b/a Analyzr. All rights reserved. https://analyzr.ai

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.