mossspider.estimators.tmle.NetworkTMLE¶
- class NetworkTMLE(network, exposure, outcome, degree_restrict=None, alpha=0.05, continuous_bound=0.0005, verbose=False)¶
Implementation of the Targeted Maximum Likelihood Estimator (TMLE) for network dependent data. The following procedure estimates the expected incidence under a treatment plan of interest. For stochastic treatment plans, the expected incidence is obtained through Monte Carlo integration of a subsample of possible treatment allotments that correspond to the plan of interest.
Note
Network-TMLE makes the weak dependence assumption, such that only direct contacts’ treatment can interfere with individual i’s outcome.
- Parameters
network (NetworkX Graph) – NetworkX undirected network without self-loops. Additionally, all variables should be stored as attributes for each node.
Targetula
extracts the node data from the graph and creates apandas.DataFrame
object from that information. It is important that no nodes have missing data. Currently there is no procedure to handle missing dataexposure (str) – String indicating the exposure variable of interest.
outcome (str) – String indicating the outcome variable of interest.
degree_restrict (None, list, tuple, optional) – Restriction on the minimum & maximum degree for nodes to be included in the estimand. Must be a list with a length of two, where the first value corresponds to the lower bound and the second is the upper bound for degree. Values are inclusive. All samples below the first value OR above the second level are considered as “background” features. Hence the intervention does not change their exposure.
alpha (float, optional) – Alpha for confidence interval level. Default is 0.05
continuous_bound (float, optional) – For continuous outcomes, TMLE needs to bound Y between 0 and 1. However, 0/1 cannot be included in these bounded values. This specification sets the bounds for the continuous outcomes. The default is 0.0005.
verbose (bool, optional) – Whether to print all intermediary model results for the estimation process. When set to True, each of the model results are printed to the console. The default is False.
Note
mossspider
calculates exposure mapping variables automatically with the input network. These variables are saved as variable-name_map. So for a variable ‘A’, the newly created exposure mapping variable calculated is ‘A_map’Note
For directed networks, the direction of of influence goes from the target node to the source (i.e. opposite of the arrow direction). If A –> B then B’s covariates will be part of the A’s summary measures.
Examples
Setting up environment
>>> from mossspider import NetworkTMLE >>> from mossspider.dgm import uniform_network, generate_observed
Generating a generic network and some data
>>> graph = generate_observed(uniform_network(n=500, degree=[1, 6]))
Estimation with NetworkTMLE (nonparametric summary measure in exposure map model)
>>> tmle = NetworkTMLE(network=graph, exposure='A', outcome='Y') >>> tmle.exposure_model('W + W_map') >>> tmle.exposure_map_model('A + W + W_map', distribution=None) >>> tmle.outcome_model('A + W + A_map + W_map', print_results=False) >>> tmle.fit(p=0.8, bound=10e5) >>> tmle.summary()
Estimation with NetworkTMLE (parametric summary measure in exposure map model)
>>> tmle = NetworkTMLE(network=graph, exposure='A', outcome='Y') >>> tmle.exposure_model('W + W_map') >>> tmle.exposure_map_model('A + W + W_map', measure='sum', distribution='poisson') >>> tmle.outcome_model('A + W + A_map + W_map', print_results=False) >>> tmle.fit(p=0.8, bound=10e5) >>> tmle.summary()
Estimation with NetworkTMLE and restricting inference by degree
>>> tmle = NetworkTMLE(network=graph, exposure='A', outcome='Y', degree_restrict=[0, 5]) >>> tmle.exposure_model('W + W_map') >>> tmle.exposure_map_model('A + W + W_map', measure='sum', distribution='poisson') >>> tmle.outcome_model('A + W + A_map + W_map', print_results=False) >>> tmle.fit(p=0.8, bound=10e5) >>> tmle.summary()
Diagnostic plot for support of policy of interest in observed data
>>> import matplotlib.pyplot as plt >>> tmle.diagnostics() >>> plt.show()
Generating a threshold measure based on a summary measure
>>> tmle = NetworkTMLE(network=graph, exposure='A', outcome='Y') >>> tmle.define_threshold(variable='A_sum', threshold=3) # A_sum_t3
Generating a category measure based on a binned summary measure
>>> tmle = NetworkTMLE(network=graph, exposure='A', outcome='Y') >>> tmle.define_category(variable='A_sum', bins=[0, 1, 2, 4, 6]) # A_sum_c
References
van der Laan MJ. (2014). Causal inference for a population of causally connected units. Journal of Causal Inference, 2(1), 13-74.
Sofrygin O & van der Laan MJ. (2017). Semi-parametric estimation and inference for the mean outcome of the single time-point intervention in a causally connected population. Journal of Causal Inference, 5(1).
Ogburn EL, Sofrygin O, Diaz I, & van der Laan MJ. (2017). Causal inference for social network data. arXiv preprint arXiv:1705.08527.
Sofrygin O, Ogburn EL, & van der Laan MJ. (2018). Single Time Point Interventions in Network-Dependent Data. In Targeted Learning in Data Science (pp. 373-396). Springer.
- __init__(network, exposure, outcome, degree_restrict=None, alpha=0.05, continuous_bound=0.0005, verbose=False)¶
Methods
__init__
(network, exposure, outcome[, ...])define_category
(variable, bins[, labels])Function arbitrarily allows for multiple different defined thresholds
define_threshold
(variable, threshold)Function arbitrarily allows for multiple different defined thresholds
diagnostics
([figsize, color_a1, color_a0])Returns diagnostic plot for the specified network-TMLE.
exposure_map_model
(model[, measure, ...])Exposure summary measure model for individual i.
exposure_model
(model[, custom_model, ...])Exposure model for individual i.
fit
(p[, samples, bound, seed])Estimation procedure under a specified treatment plan.
outcome_model
(model[, custom_model, ...])Estimation of the outcome model E(Y|A, A_map, W, W_map).
summary
([decimal])Prints summary results for the sample average treatment effect under the treatment plan specified in the fit procedure
- exposure_model(model, custom_model=None, custom_model_sim=None)¶
Exposure model for individual i. Estimates Pr(A=a|W, W_map) using a logistic regression model.
Note
This function only saves the model specifications. IPTW are calculated later during the fit() procedure since the policy is needed.
- Parameters
model (str) – Exposure mapping model. Ideally would include treatment for individual i
custom_model – User-specified model
custom_model_sim – User-specified model. This allows the user to specify a different IPW model to be fit for the numerator. That model is fit to the simulated data, so some constraints may be added to speed up the estimation procedure. If None and custom_model is not None, copies over the custom_model used.
- exposure_map_model(model, measure=None, distribution=None, custom_model=None, custom_model_sim=None)¶
Exposure summary measure model for individual i. Estimates Pr(A_map=a|A=a, W, W_map) using a logistic regression model.
Note
Only saves the model specifications. IPTW are calculated later during the fit() function
There are several options for the distributions of the summary measure. One option is a non-parametric approach that estimates the probability for each individual contact (works best for uniform distributions). However, this approach may not always be possible to estimate. Instead, parametric distributional assumption can be used instead. Currently, implemented are normal and Poisson distributions.
- Parameters
model (str) – Exposure mapping model. Ideally would include treatment for individual i
measure (None, str, optional) – Exposure mapping to use for the modeling statement. Options include ‘mean’ and ‘sum’. Default is None which natively works with the distribution=None option
distribution (None, str, optional) – Distribution to use for exposure mapping model. Options include: non-parametric (None), Normal (‘normal’), Poisson (‘poisson’).
custom_model (None, optional) – User-specified model
custom_model_sim – User-specified model. This allows the user to specify a different IPW model to be fit for the numerator. That model is fit to the simulated data, so some constraints may be added to speed up the estimation procedure. If None and custom_model is not None, copies over the custom_model used.
- outcome_model(model, custom_model=None, distribution='normal')¶
Estimation of the outcome model E(Y|A, A_map, W, W_map).
Note
Estimates the outcome model (g-formula) using the observed data and generates predictions under the observed distribution of the exposure.
- Parameters
model (str) – Specified Q-model
custom_model – User-specified model
distribution (optional, str) – For non-binary outcome variables, the distribution of Y must be specified. Default is ‘normal’.
- fit(p, samples=100, bound=None, seed=None)¶
Estimation procedure under a specified treatment plan.
This function estimates the IPTW for the treatment plan of interest, performs the target steps, and performs Monte Carlo integration with the targeted model, and calculates confidence intervals. Confidence intervals are obtained from influence curves.
- Parameters
p (float, int, list, set) – Percent of population to treat. For conditional treatment plans, a container object of floats. All values must be between 0 and 1
samples (int) – Number of samples to generate to calculate numerator for weights and for the Monte Carlo integration procedure for stochastic treatment plans. For deterministic treatment plans (p==1 or p==0), samples is set to 1 to reduce computation burden. Deterministic treatment plan do not require the Monte Carlo integration procedure
bound (None, int, float) – Bounds to truncate calculate weights by…
seed (int, None) – Random seed for the Monte Carlo integration procedure
- summary(decimal=3)¶
Prints summary results for the sample average treatment effect under the treatment plan specified in the fit procedure
- Parameters
decimal (int) – Number of decimal places to display
- Returns
- Return type
None
- diagnostics(figsize=(6, 5), color_a1='blue', color_a0='red')¶
Returns diagnostic plot for the specified network-TMLE. The currently available diagnostic presents plots of the designated summary measure for \(A^s\) (stratified by \(A\)) for the observed data, and the Monte Carlo simulated data. This diagnostic can be used to visually assess whether the designated policy is poorly-supported by the data.
Note
A policy that has little overlap with the observed data is indicative of the policy being poorly supported by the observed data. Poorly-supported policies may not be well estimated and thus considering other stochastic policies in recommended.
- Parameters
figsize (list, set, array, optional) – Determine the figure size (dimensions). Passes directly to
plt.subplots(...figsize=figsize)
.color_a1 (str, optional) – Color for the A=1 group in the figure. Default is blue.
color_a0 (str, optional) – Color for the A=0 group in the figure. Default is red.
- Returns
- Return type
Diagnostic plot for data support of policy.
- define_threshold(variable, threshold)¶
Function arbitrarily allows for multiple different defined thresholds
- Parameters
variable (str) – Variable to generate categories for
threshold (int, float) – Threshold to use as the cutpoint.
- define_category(variable, bins, labels=False)¶
Function arbitrarily allows for multiple different defined thresholds
- Parameters
variable (str) – Variable to generate categories for
bins (list, set, array) – Bin cutpoints to generate the categorical variable for. Uses
pandas.cut(..., include_lowest=True)
to create the binned variables.labels (list, set, array) – Specified labels. Can be given custom labels, but generally recommend to keep set as False