<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>David&#39;s blog</title>
<link>https://blog.davidlindelof.com/</link>
<atom:link href="https://blog.davidlindelof.com/index.xml" rel="self" type="application/rss+xml"/>
<description>A blog built with Quarto</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Sat, 11 Apr 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Post With Code</title>
  <dc:creator>Harlow Malloc</dc:creator>
  <link>https://blog.davidlindelof.com/posts/post-with-code/</link>
  <description><![CDATA[ 






<p>This is a post with executable code.</p>



 ]]></description>
  <category>news</category>
  <category>code</category>
  <category>analysis</category>
  <guid>https://blog.davidlindelof.com/posts/post-with-code/</guid>
  <pubDate>Sat, 11 Apr 2026 00:00:00 GMT</pubDate>
  <media:content url="https://blog.davidlindelof.com/posts/post-with-code/image.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Welcome To My Blog</title>
  <dc:creator>Tristan O&#39;Malley</dc:creator>
  <link>https://blog.davidlindelof.com/posts/welcome/</link>
  <description><![CDATA[ 






<p>This is the first post in a Quarto blog. Welcome!</p>
<p><img src="https://blog.davidlindelof.com/posts/welcome/thumbnail.jpg" class="img-fluid"></p>
<p>Since this post doesn’t specify an explicit <code>image</code>, the first image in the post will be used in the listing page of posts.</p>



 ]]></description>
  <category>news</category>
  <guid>https://blog.davidlindelof.com/posts/welcome/</guid>
  <pubDate>Wed, 08 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>How Are P-values Distributed Under The Null?</title>
  <link>https://blog.davidlindelof.com/posts/2025/01/how-are-p-values-distributed-under-the-null/</link>
  <description><![CDATA[ 






<p>I sometimes use this fun interview question for aspiring data scientists:</p>
<blockquote class="blockquote">
<p>How are p-values distributed assuming the null hypothesis is true?</p>
</blockquote>
<p><img src="https://blog.davidlindelof.com/posts/2025/01/how-are-p-values-distributed-under-the-null/images/DALL·E-2025-01-22-09.19.14-A-cartoon-style-illustration-of-a-data-scientist-struggling-at-a-job-interview.-The-scene-shows-a-person-in-a-shirt-and-tie-sitting-nervously-in-fron.webp" class="img-fluid"></p>
<p>I’ve heard a lot of reasonable answers, including:</p>
<ul>
<li><p>It should be centered towards large values</p></li>
<li><p>it should have almost zero mass below 0.05</p></li>
<li><p>It depends on the model</p></li>
<li><p>It depends on the null hypothesis</p></li>
</ul>
<p>All very reasonable and intuitive answers which I would probably, at some point, have given myself. They’re also all wrong.</p>
<p><strong>The (perhaps surprising) answer is that under <em>any</em> null hypothesis, the p-values are uniformly distributed: <em>all</em> p-values between 0 and 1 are equally likely.</strong></p>
<p>Before we give a formal proof, here’s some intuition. For any significance level <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha">, how often will a statistical test under the null yield a significant result? Of course <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha">, by the definition of the significance level. But for a test to be significant at <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha">, it must be true that the p-value <img src="https://latex.codecogs.com/png.latex?p%20%3C%20%5C%5Calpha">. So we’re saying that <img src="https://latex.codecogs.com/png.latex?p%20%3C%20%5C%5Calpha"> with probability <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha">. Or <img src="https://latex.codecogs.com/png.latex?Pr(p%20%3C%20%5C%5Calpha)%20=%20%5C%5Calpha">, which is the definition of a uniform distribution.</p>
<p>More formally, when we perform a statistical test, we calculate some statistic <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7BS%7D"> from the data. Under the null, this statistic follows some distribution <img src="https://latex.codecogs.com/png.latex?S">. The statistic <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7BS%7D"> is associated with a p-value <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bp%7D">, which by definition is the probability that the test statistic is at least as extreme as <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7BS%7D">: <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bp%7D%20=%20Pr(S%20%3E%20%5C%5Chat%7BS%7D)">. But note also that for the p-value to be smaller than <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bp%7D"> would require that the test statistic be larger than <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7BS%7D">, so <img src="https://latex.codecogs.com/png.latex?Pr(p%20%3C%20%5C%5Chat%7Bp%7D)%20=%20Pr(S%20%3E%20%5C%5Chat%7BS%7D)">, which we just said is equal to <img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bp%7D">. So <img src="https://latex.codecogs.com/png.latex?Pr(p%20%3C%20%5C%5Chat%7Bp%7D)%20=%20%5C%5Chat%7Bp%7D">, which is again the definition of a uniform distribution.</p>
<p>Notice that nowhere did I have to assume anything about <img src="https://latex.codecogs.com/png.latex?S">, the distribution of the test statistic. This result holds no matter what test statistic we do. Let’s see this in action for two common statistical tests.</p>
<section id="the-t-test" class="level2">
<h2 class="anchored" data-anchor-id="the-t-test">The t-test</h2>
<p>The t-test tests for the equality of means between two samples. The null hypothesis states that both samples are drawn from the same (normal) distribution. So, to see how the p-value is distributed, we’ll draw two equal-sized samples from the same distribution, compute the p-value from the t-test, and repeat:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">one_ttest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>() {</span>
<span id="cb1-2">  x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb1-3">  y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb1-4">  test <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t.test</span>(x, y)</span>
<span id="cb1-5">  test<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>p.value</span>
<span id="cb1-6">}</span>
<span id="cb1-7"></span>
<span id="cb1-8">p_values_ttest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">one_ttest</span>())</span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(p_values_ttest)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2025/01/how-are-p-values-distributed-under-the-null/images/Pasted-image-20250109093750-1024x474.png" class="img-fluid"></p>
<p>As expected, the p-values are uniformly distributed from 0 to 1. There is no evidence of any accumulation of mass towards higher values, nor is there any evidence that p-values smaller than 0.05 are less likely.</p>
</section>
<section id="the-binomial-test" class="level2">
<h2 class="anchored" data-anchor-id="the-binomial-test">The binomial test</h2>
<p>The binomial test tests whether an empirical proportion is different than a hypothesized proportion <img src="https://latex.codecogs.com/png.latex?p">. The null hypothesis states that the sample is drawn from a population where the condition of interest happens with probability <img src="https://latex.codecogs.com/png.latex?p">. So we’ll follow the same method as above:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">one_binomtest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>() {</span>
<span id="cb2-2">  prob <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span></span>
<span id="cb2-3">  successes <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, prob)</span>
<span id="cb2-4">  test <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">binom.test</span>(successes, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p =</span> prob)</span>
<span id="cb2-5">  test<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>p.value</span>
<span id="cb2-6">}</span>
<span id="cb2-7"></span>
<span id="cb2-8">p_values_binomtest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">one_binomtest</span>())</span>
<span id="cb2-9"></span>
<span id="cb2-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(p_values_binomtest)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2025/01/how-are-p-values-distributed-under-the-null/images/Pasted-image-20250109094309-1024x474.png" class="img-fluid"></p>
<p>As above, there’s no reason to suspect that the p-values are anything else than uniformly distributed</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>In the t-test or the binomial test we didn’t have to specify any significance level, we just looked at the distribution of a p-value assuming the null hypothesis to be true. We found that, as predicted by theory, the p-values are uniformly distributed between 0 and 1, and that therefore the probability of rejecting the null at a significance level <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha"> is precisely <img src="https://latex.codecogs.com/png.latex?%5C%5Calpha">. All p-values between 0 and 1 are equally likely, no matter what statistical test you use (with some exceptions, such as a discrete test distribution).</p>
</section>
<section id="addendum" class="level2">
<h2 class="anchored" data-anchor-id="addendum">Addendum</h2>
<p>I’ve posted a <a href="https://youtu.be/foCQAsMK7vk">short YouTube video</a> illustrating these examples.</p>


</section>

 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2025/01/how-are-p-values-distributed-under-the-null/</guid>
  <pubDate>Wed, 22 Jan 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Your Classifier Is Broken, But It Is Still Useful</title>
  <link>https://blog.davidlindelof.com/posts/2025/01/estimating-the-true-prevalence-from-a-biased-classifier/</link>
  <description><![CDATA[ 






<p>When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the <em>prevalence</em>.</p>
<p>But that estimate is <em>biased</em>, because no classifier is perfect. For example, if your classifier tells you that you have 20% of positive cases, but its precision is known to be only 50%, you would expect the true prevalence to be <img src="https://latex.codecogs.com/png.latex?0.2%20%5C%5Ctimes%200.5%20=%200.1">, i.e.&nbsp;10%. But that’s assuming perfect recall (all true positives are flagged by the classifier). If the recall is less than 1, then you know the classifier missed some true positives, so you <em>also</em> need to normalize the prevalence estimate by the recall.</p>
<p>This leads to the common formula for getting the true prevalence <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(y=1)"> from the positive prediction rate <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(%5C%5Chat%7By%7D=1)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%20%0A%5C%5CPr(y=1)%20=%20%5C%5CPr(%5C%5Chat%7By%7D=1)%20%5C%5Ctimes%20%5C%5Cfrac%7BPrecision%7D%7BRecall%7D%20%20%0A"></p>
<p>But suppose that you want to run the classifier more than once. For example, you might want to do this at regular intervals to detect trends in the prevalence. You can’t use this formula anymore, because <em>precision depends on the prevalence</em>. To use the formula above you would have to re-estimate the precision regularly (say, with human eval), but <a href="https://stats.stackexchange.com/questions/273237/estimating-prevalence-from-a-classifiers-precision-and-recall">then you could just as well also re-estimate the prevalence itself</a>.</p>
<p>How do we get out of circular reasoning? It turns out that binary classifiers have other performance metrics (besides precision) that do not depend on the prevalence. These include not only the recall <img src="https://latex.codecogs.com/png.latex?R"> but also the specificity <img src="https://latex.codecogs.com/png.latex?S">, and these metrics can be used to adjust <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(%5C%5Chat%7By%7D=1)"> to get an unbiased estimate of the true prevalence using this formula (sometimes called <em>prevalence adjustment</em>):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5C%5CPr(y=1)%20=%20%5C%5Cfrac%7B%5C%5CPr(%5C%5Chat%7By%7D=1)%20-%20(1%20-%20S)%7D%7BR%20-%20(1%20-%20S)%7D"><br>
where:</p>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?%5C%5CPr(y=1)"> is the true prevalence</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?S"> is the specificity</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?R"> is the sensitivity or recall</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?%5C%5CPr(%5C%5Chat%7By%7D=1)"> is the proportion of positives</p></li>
</ul>
<p>The proof is straightforward:</p>
<p><img src="https://latex.codecogs.com/png.latex?%20%20%0A%5C%5Cbegin%7Baligned%7D%20%20%0A%5C%5CPr(%5C%5Chat%7By%7D=1)%20&amp;=%20%5C%5CPr(%5C%5Chat%7By%7D=1,%20y%20=%201)%20+%20%5C%5CPr(%5C%5Chat%7By%7D=1,%20y%20=%200)%20%5C%5C%5C%5C%20%20%0A&amp;=%20%5C%5CPr(%5C%5Chat%7By%7D=1%20%7C%20y%20=%201)%20%5C%5Ctimes%20%5C%5CPr(y%20=%201)%20+%20%5C%5CPr(%5C%5Chat%7By%7D=1%20%7C%20y%20=%200)%20%5C%5Ctimes%20%5C%5CPr(y%20=%200)%20%5C%5C%5C%5C%20%20%0A&amp;=%20R%20%5C%5Ctimes%20%5C%5CPr(y%20=%201)%20+%20(1%20-%20S)%20%5C%5Ctimes%20(1%20-%20Pr(y%20=%201))%20%20%0A%5C%5Cend%7Baligned%7D%20%20%0A"><br>
Solving for <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(y%20=%201)"> yields the formula above.</p>
<p>Notice that this formula breaks down when the denominator <img src="https://latex.codecogs.com/png.latex?R%20-%20(1%20-%20S)"> becomes 0, or when recall becomes equal to the false positive rate <img src="https://latex.codecogs.com/png.latex?1-S">. But remember what a typical ROC curve looks like:</p>
<p><img src="https://blog.davidlindelof.com/posts/2025/01/estimating-the-true-prevalence-from-a-biased-classifier/images/Pasted-image-20241216172508.png" class="img-fluid"></p>
<p>An ROC curve like this one plots recall <img src="https://latex.codecogs.com/png.latex?R"> (aka true positive rate) against the false positive rate <img src="https://latex.codecogs.com/png.latex?1-S">, so a classifier for which <img src="https://latex.codecogs.com/png.latex?R%20=%20(1-S)"> is a classifier falling on the diagonal of the ROC diagram. This is a classifier that is, essentially, guessing randomly. True cases and false cases are equally likely to be classified positively by this classifier, so the classifier is completely non-informative, and you can’t learn anything from it–and certainly not the true prevalence.</p>
<p>Enough theory, let’s see if this works in practice:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># randomly draw some covariate</span></span>
<span id="cb1-2">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take the logit and draw the outcome</span></span>
<span id="cb1-5">logit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plogis</span>(x)</span>
<span id="cb1-6">y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> logit</span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit a logistic regression model  </span></span>
<span id="cb1-9">m <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> binomial)</span>
<span id="cb1-10"></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make some predictions, using an absurdly low threshold</span></span>
<span id="cb1-12">y_hat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(m, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span></span>
<span id="cb1-13"></span>
<span id="cb1-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the recall (aka sensitivity) and specificity</span></span>
<span id="cb1-15">c <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> caret<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confusionMatrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(y_hat), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(y), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">positive =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TRUE"</span>)</span>
<span id="cb1-16">recall <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unname</span>(c<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>byClass[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Sensitivity'</span>])</span>
<span id="cb1-17">specificity <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unname</span>(c<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>byClass[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Specificity'</span>])</span>
<span id="cb1-18"></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the adjusted prevalence</span></span>
<span id="cb1-20">(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(y_hat) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> specificity)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (recall <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> specificity))</span>
<span id="cb1-21"></span>
<span id="cb1-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># compare with actual prevalence</span></span>
<span id="cb1-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(y)</span></code></pre></div></div>
<p>In this simulation I get <code>recall = 0.049</code> and <code>specificity = 0.875</code>. The predicted prevalence is a ridiculously biased <code>0.087</code>, but the adjusted prevalence is essentially equal to the true prevalence (<code>0.498</code>).</p>
<p>To sum up: this shows how, using a classifier’s recall and specificity, you can adjusted the predicted prevalence to track it over time, assuming that recall and specificity are stable over time. <em>You cannot do this using precision and recall</em> because precision depends on the prevalence, whereas recall and specificity don’t.</p>



 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2025/01/estimating-the-true-prevalence-from-a-biased-classifier/</guid>
  <pubDate>Wed, 08 Jan 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Things I wish they taught in school</title>
  <link>https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/</link>
  <description><![CDATA[ 






<p>Before he became Spider-Man in the 1960s, Peter Parker was a chemistry and physics genius, an expert photographer, and–get this–<em>even knew how to tie a tie</em>. But he was also a shy, nerdy high-school student who couldn’t have been more than 16-18 years old:</p>
<p><img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/872c67abe10d4f7b7dbcddca1d8513ed.png" class="img-fluid"></p>
<p>Wait, a high school student going around in a suit and tie? I know this is Marvel but where did they get <em>that</em> idea from?</p>
<p>There seems to be plenty of evidence from comics, books, and movies that youngsters back then were far more mature than today. Could it be that schools used to teach kids fundamental skills earlier than they are taught today? Are there fundamental skills that schools don’t teach anymore? In this post I’ll list a few skills that I think are essential for success in life, yet are hardly taught anymore, not even in college.</p>
<section id="public-speaking" class="level2">
<h2 class="anchored" data-anchor-id="public-speaking">Public speaking</h2>
<p>I can still remember the first time I had to give a public presentation about some lab work I had done with a classmate, something about a hypersensitive magnetometer called <a href="https://en.wikipedia.org/wiki/SQUID">a SQUID</a>. Must have been in my sophomore year. It was a small audience of perhaps 10 teaching assistants, but I hated every minute of it. I basically stood there, reading out from on a stack of cards.</p>
<p>I can’t remember how I learned about it, but I eventually learned about <a href="https://www.toastmasters.org/">Toastmasters</a>, the international organisation dedicated to promoting leadership and public speaking, and found out that my college had its own local club. I remained a member of Toastmasters for the better part of a decade. I’m not afraid anymore of speaking in public, quite the contrary; I actively seek opportunities to do so, both professionally and privately (I once preached to a community of christian Thais).</p>
<p>Joining a local Toastmasters club is easily one of the best investments in time you can do for yourself.</p>
</section>
<section id="mind-mapping" class="level2">
<h2 class="anchored" data-anchor-id="mind-mapping">Mind mapping</h2>
<p>I don’t understand why they don’t teach this simple technique already in primary school, but <a href="https://en.wikipedia.org/wiki/Mind_map">mind mapping</a> (the practice of graphically capturing free associations about a given topic) is usually the first thing I do to prepare a presentation, an article, a design document, or even a blog post like this one.</p>
<p>I understand there’s pretty good free software that supports mind-mapping (<a href="https://miro.com/">Miro</a> is one that comes highly recommended), but I learned the technique from one of <a href="https://a.co/d/4sHVcgu">Tony Buzan’s books</a>, with a heavy focus on handwriting. That’s the technique I still prefer, and my notebooks are full of them.</p>
</section>
<section id="mnemonics" class="level2">
<h2 class="anchored" data-anchor-id="mnemonics">Mnemonics</h2>
<p>I don’t know the PIN number of my credit card, but I’ll never, ever be able to forget it. Years ago I taught myself the <a href="https://en.wikipedia.org/wiki/Mnemonic_major_system">Major system</a>, a technique for memorizing numbers, and have used it for phone numbers, PIN numbers, and other short numbers.</p>
<p>It is certainly not the only mnemonic system out there; you have almost certainly heard about the <a href="https://en.wikipedia.org/wiki/Method_of_loci">Memory palace</a> technique, which I occasionally use to memorize grocery lists (although I don’t practice it often enough to call myself proficient).</p>
<p>Yes, I know you can use it to memorize decks of cards and other impressive parlour tricks but I’ve never felt the need to push the skill to that level. But memorizing numbers or lists? Anytime. I also hear that practicing that skill helps develop the ability to concentrate.</p>
</section>
<section id="touch-typing" class="level2">
<h2 class="anchored" data-anchor-id="touch-typing">Touch typing</h2>
<p>I spent almost 10 years learning Emacs. I think most of my graduate thesis was written with Emacs, using a combination of <a href="https://ess.r-project.org/">Emacs extensions</a> for LaTeX and R. I loved every bit of it. I still think it’s one of the best editors out there. Yet, I switched to vi. Why? Because I taught myself touch typing.</p>
<p><a href="https://en.wikipedia.org/wiki/Touch_typing">Touch typing</a> is the ability to type without looking at the keyboard, relying instead on muscle memory to find the positions of the keys. Ever wondered what those small raised indentations on the F and J key are for? They’re for repositioning your index fingers when you touch type.</p>
<p>What’s that got to do with vi? Well vi is designed with touch typists in mind. Navigating through a document is done with the H (left), J (up), K (down), and L (right) keys. Similarly, the most common editing operations are done through keys on the home row (D for delete, F for find, S for substitute etc).</p>
<p>I have invested in a <a href="https://www.daskeyboard.com/">Das Keyboard</a> with blank keys and work most of the time without looking at the keys. Am I any good at it? I’m not sure. But I’m definitely a faster typer than when I began to learn touch typing, and can transcribe a passage from a book or an article without looking either at the screen or the keyboard. And boy do I love vi now (more on that below).<br>
<img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/Pasted-image-20230712083814-6-300x300.png" class="img-fluid"><img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/9e41bf8f44ad61b2c0e9b1d3459954b9.png" class="img-fluid"></p>
</section>
<section id="note-taking" class="level2">
<h2 class="anchored" data-anchor-id="note-taking">Note-taking</h2>
<p>I’ve kept practically all the notes I’ve taken, both privately and professionally, for more than a decade. As much as possible I try to take notes in a single, nice, bound notebook rather than on a pad of disposable paper.</p>
<p>But that’s not what this is about. This is about taking meeting notes. (What follows applies equally well to taking lecture notes.)</p>
<p>Most meetings should have a dedicated note-taker. Meeting notes are important to ensure that there’s a written artifact that captures the information and decisions made during the meeting. But most people take notes by opening a Word or Google Doc and write a bullet point for every information item they notice. This is a terrible way to keep meeting notes.</p>
<p>First, ditch the laptop. There’s plenty of evidence that handwriting does something to the brain that increases recall. Second, use the <a href="https://en.wikipedia.org/wiki/Cornell_Notes">Cornell note-taking system</a>. In a nutshell, you draw a vertical line across the page that divides it into two columns, the rightmost of which is about double the width of the leftmost. Don’t draw the line to the bottom but keep a few lines for your summary.</p>
<p><img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/NotesCornell-1-6.png" class="img-fluid"></p>
<p>In the note-taking column on the right you capture the key ideas from the meeting or the lecture. Use symbols or abbreviations liberally. Keywords or key question are recorded in the recall column (on the left). If necessary, use the summary space for, for example, capturing action items.</p>
<p>It’s the best method I’ve come across for effective note-taking. The Manager Tools people also endorse it and have a <a href="https://www.manager-tools.com/2007/07/how-to-take-notes">podcast episode about it</a>. It is a mystery to me why such a system isn’t taught in schools, where it would be most beneficial.</p>
<p>(If you want to take your note-taking skills to the next level you might want also to look into <a href="https://rohdesign.com/sketchnotes">sketchnotes</a>. Introduced by Mike Rohde in 2007, the idea is to combine drawing with notes. I still use that system in church, capturing sermons that way.)</p>
</section>
<section id="systems-thinking" class="level2">
<h2 class="anchored" data-anchor-id="systems-thinking">Systems thinking</h2>
<p>Schools mainly focus on teaching one specialized topic at a time, but the world is far more complex than a collection of independent ideas. Systems thinking consists in making sense of complex behavior by considering wholes and relationships.</p>
<p>It may sound abstract, possibly esoteric, but you will come across many complex, non-linear systems that cannot be understood by simply breaking them down to their constituent parts. Personally I’ve learned a lot from Gerald Weinberg’s <a href="https://geraldmweinberg.com/Site/General_Systems.html">books about systems thinking</a> and how you can apply them to reason about a software development process.</p>
</section>
<section id="logical-fallacies-cognitive-biases" class="level2">
<h2 class="anchored" data-anchor-id="logical-fallacies-cognitive-biases">Logical fallacies / cognitive biases</h2>
<p>Recently I witnessed the following exchange of comments debating the pros and cons of a vegan lifestyle.</p>
<blockquote class="blockquote">
<p>- Doesn’t a vegan diet lead to nutritional deficiency, especially of vitamin B12?<br>
- No, because many people on a non-vegan diet also suffer from B12 deficiencies.</p>
</blockquote>
<p>In another video I saw someone giving this impeccable argument on why dairy milk was bad for you:</p>
<blockquote class="blockquote">
<p>Don’t they say that dairy milk is good for you? Sure, but remember that they said the same thing about tobacco smoke.</p>
</blockquote>
<p>I’m not going to settle that debate here, but I wanted to show these examples of a <a href="https://en.wikipedia.org/wiki/Fallacy_of_the_single_cause">fallacy of the single cause</a>: a logical fallacy where it is implicitly assumed that a consequence can have only a single cause. (It is also, possibly, an example of a <a href="https://en.wikipedia.org/wiki/Straw_man">straw man</a>.)</p>
<p>I believe the importance of recognizing such fallacies cannot be understated, yet I have yet to come across a classroom where this gets taught.</p>
<p>Wikipedia has a <a href="https://en.wikipedia.org/wiki/List_of_fallacies">great list of fallacies</a>, and also an equally <a href="https://en.wikipedia.org/wiki/List_of_cognitive_biases">great list of cognitive biases</a>. My favorite? The <a href="https://en.wikipedia.org/wiki/Chewbacca_defense">Chewbacca defense</a>, of course.</p>
</section>
<section id="conditional-probabilities" class="level2">
<h2 class="anchored" data-anchor-id="conditional-probabilities">Conditional probabilities</h2>
<p>This is probably a special example of the <a href="https://en.wikipedia.org/wiki/Base_rate_fallacy">base rate fallacy</a> (one of the many logical fallacies mentioned in the previous section) but it shows up often enough that it deserves its own category.</p>
<p>Consider the following example: let’s say you have a diagnostic test for a rare disease. The test has a false positive rate of 0.01% (a specificity of 99.99%), and a false negative rate of 0% (if you have the disease, the test will for sure show positive). The disease’s prevalence is 0.00001%. You get a positive test result. Should you be worried? What’s the probability that you really have the disease?</p>
<p>Most people would say that the probability should be extremely high; probably not as high as 99.9% but not too far off either. I won’t do the math here but the answer, it turns out, is just 1 in a thousand. Not zero exactly, but far from near certainty.</p>
</section>
<section id="hand-writing" class="level2">
<h2 class="anchored" data-anchor-id="hand-writing">Hand writing</h2>
<p>I’m going to bet that you think you have a terrible handwriting. So do most of us. And that’s not even the fault of the school system. My son, in middle school here in Switzerland, had to learn a certain way of tracing letters and God forbid that he should deviate from the norm. Only problem was that the handwriting system he was taught was not only ugly, it was also slow and impractical.</p>
<p>Some years ago I became interested in improving my own handwriting and came across the <a href="https://handwritingsuccess.com/">Getty-Dubay handwriting system</a>. It’s a self study guide, a beautiful book (entirely handwritten itself!) that teaches you two handwriting systems, an italic one and a cursive one. Here’s a sample of my handwriting before the course and after:</p>
<p><img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/2024-07-16-19-21-5-1024x563.jpg" class="img-fluid"></p>
<p><img src="https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/images/b3c3fc6be4c930e6c6714c6a37c11210-1024x544.jpg" class="img-fluid"></p>
<p>Most adults I know would definitely benefit from working through this course.</p>
</section>
<section id="bookkeeping" class="level2">
<h2 class="anchored" data-anchor-id="bookkeeping">Bookkeeping</h2>
<p>I don’t mean professional bookkeeping here, where you keep the books for a commercial entity. I mean personal finance, where you track all your expenses and balance a budget.</p>
<p>For years now I’ve been tracking my expenses— first, rather unsuccessfully, with the <a href="https://www.gnucash.org/">Gnucash</a> open source software, but since 2016 with <a href="https://www.ynab.com/">You Need A Budget</a>.</p>
<p>The nice thing about YNAB is that not only will you learn the basic of accounting, especially the double entry system, but you’ll also adopt their budgeting philosophy where every dollar gets to work. YNAB has a ton of resources and videos explaining these concepts.</p>
</section>
<section id="text-editor" class="level2">
<h2 class="anchored" data-anchor-id="text-editor">Text editor</h2>
<p>If you’re reading this blog, chances are that most of your output flows through your fingers into a computer. And there are two obstacles on the way that prevent you from typing as fast as you think: the keyboard and the text editor.</p>
<p>We’ve dealt with the keyboard earlier, let’s talk about the editor. There used to be a time when you could spend your entire day in the same text editor: read your email, program, write, and organize your day. The text editor was such a key piece of your daily workflow that learning to use it well was key to increased productivity.</p>
<p>With the advent of advanced IDEs it has become rare to spend one’s entire time in a single text editor; but old habits die hard and most IDEs offer keybindings that mimic most functionalities of those text editors.</p>
<p>Learning and mastering a single text editor remains a valuable skill to have; not only will you have a go-to tool for all text editing work, but odds are that you will be able to transfer those skills when you need to use an IDE or some similar environment.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>That sums it up. This post was way longer than I thought it would be. I’m not suggesting that everything above should be reintroduced in the school curriculum at once, but there’s no reason why educators (or parents, even) couldn’t cover the basics of some of these in an hour or two.</p>


</section>

 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2024/07/things-i-wish-they-taught-in-school/</guid>
  <pubDate>Mon, 22 Jul 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>How to set up a reverse SSH tunnel with Amazon Web Services</title>
  <link>https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/</link>
  <description><![CDATA[ 






<p>When the startup shut down there were still dozens of netbooks out there in the wild collecting data on the residential houses fitted with our adaptive heating control algorithms, hopelessly attempting to connect to our VPN server that didn’t exist anymore in order to upload all that data to our now-defunct database. That’s a lot of data, sitting and growing on a lot of internet-connected devices.</p>
<p>Some of us came together and figured it could be possible to resume collecting that data, and showcase the benefits of having our system installed on your house. The first problem was, how do we connect to these netbooks? And at near-zero cost?</p>
<p>Warning: hacks ahead.</p>
<p>We figured that step one would be to establish a reverse SSH tunnel to each of these netbooks. A reverse SSH tunnel is set up when an otherwise-inaccessible device (in our case, the netbooks) connects to a publicly available SSH server, opens a port on the server, and forwards (“tunnels”) all incoming connections to that port back to the device. This is the best solution to connect to a device that’s not exposed to the public internet short of setting up a proper VPN solution.</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-5.png" class="img-fluid"></p>
<p>To set up a reverse SSH tunnel you first need a publicly available machine running an SSH server and that will accept reverse tunnels. The good news is that you can all have one by <a href="https://aws.amazon.com/free/?trk=f17b4b4e-aa1b-4189-b0c4-81a19b53f625&amp;sc_channel=ps&amp;ef_id=CjwKCAjwscGjBhAXEiwAswQqNPxA3h4EaldAOteOFNJWQtQmuWHHFB-EcdIVMoZIByFZ2rC0nQSm-RoCRpEQAvD_BwE:G:s&amp;s_kwcid=AL!4422!3!645186168166!e!!g!!aws!19579892551!148838343321&amp;all-free-tier.sort-by=item.additionalFields.SortRank&amp;all-free-tier.sort-order=asc&amp;awsf.Free%20Tier%20Types=*all&amp;awsf.Free%20Tier%20Categories=*all">signing up to Amazon Web Services</a> (AWS) and going to the Elastic Cloud 2 (EC2) service:</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-270x300.png" class="img-fluid"></p>
<p>Next you want to launch an instance:</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-1-300x147.png" class="img-fluid"></p>
<p>You really want the smallest, freeest possible machine here that runs Linux:</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-2-932x1024.png" class="img-fluid"></p>
<p>Make sure you have generated a key pair for this instance (and that you have saved the private key!) and that the machine accepts SSH from anywhere:</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-3-878x1024.png" class="img-fluid"></p>
<p>But when you set up an SSH tunnel you will also need to make sure the EC2 instance accepts SSH traffic on the ports that will be opened by the tunnel. These are up to you; I have created two tunnels, one on port 7030 and one on 7040, so navigate to the settings for the security group of your instance and make sure the instance will accept TCP traffic to these ports:</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/images/image-4-1024x223.png" class="img-fluid"></p>
<p>That’s all on the server side. On the netbook side you need to do three things: 1) get the private key, 2) change the file permissions on the key, 3) establish the tunnel.</p>
<p>Getting the private key to the netbook is entirely up to you. What I did, and which is absolutely not recommended, was to place the private key <code>neurobat.pem</code> on the same web server hosting this blog. Then I was able to get the key with</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wget</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--no-check-certificate</span> davidlindelof.com/<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>path-to-key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>(Notice the <code>--no-check-certificate</code> argument. Those netbooks are hopelessly out of date and won’t accept HTTPS certificates anymore.)</p>
<p>Next you need to set the right permissions on the key, or SSH will not accept them:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">chmod</span> 400 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>path-to-key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>And finally you can set up the tunnel, say on port 7000:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-i</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>path-to-key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> -fN <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-R</span> :7000:localhost:22 ec2-user@<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>ec2-ip-address<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>If all went well you’ll now be able to ssh into the remote device by sshing to your EC2 instance on port 7000:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>username-on-device<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>@<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>ec2-ip-address<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> -p 7000</span></code></pre></div></div>
<p>As an extra precaution you might also want to look into using the <a href="https://wiki.gentoo.org/wiki/Autossh">autossh</a> program, which can detect connection drops and attempt to reconnect.</p>
<p>Clunky? Sure. Hacky? You bet. Brittle? Oh my god. But it did the job and I can now work on doing things the “right” way, i.e.&nbsp;setting up a proper VPN solution, probably based on <a href="https://openvpn.net/">OpenVPN</a> or something.</p>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/05/how-to-set-up-a-reverse-ssh-tunnel-with-amazon-web-services/</guid>
  <pubDate>Wed, 31 May 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Deep silence or deep work</title>
  <link>https://blog.davidlindelof.com/posts/2023/05/deep-silence-or-deep-work/</link>
  <description><![CDATA[ 






<p>It’s Monday afternoon. It’s a holiday but I have a couple of things to catch up from last week that I didn’t finish. The rest of the family is either on holiday camp or taking a nap in the bedroom. I’m working from home. But the home is anything but silent.</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/deep-silence-or-deep-work/images/image-1.png" class="img-fluid"></p>
<p>I can hear the girls’ muffled chatting, from the sound of it they’re making up some story with their dolls. The village church bell just tolled a single note for the quarter past the hour. My phone’s notification just dinged, and in a rare moment of self discipline I don’t pick it up. Some birds are chirping outside. The convection oven in the kitchen has had a malfunction in years and emits a beep every 10 seconds that I have learned to ignore. Occasionally a plane comes in overhead to land on Geneva’s airport; there’s only one landing strip and depending on the direction of the wind, planes come in from the direction of our village. And on top of it all I hear some kind of background whine that’s very soft–I usually don’t notice it but it’s definitely there and I don’t know if it comes from outside of me or from inside my head.</p>
<p>That’s a lot of noise. It’s also the best possible working conditions I’ve ever experienced. Today I’ve chosen to deliberately notice all these sounds and now I cannot unhear them.</p>
<p>Then there’s the visual distractions. I’ve been working for the past three years from a corner in the living room, the rest of which fills my field of view, as well as parts of the kitchen.</p>
<p>These working conditions sound bad but they can be fixed. I usually set a screen between me and the rest of the living room, and almost always do my deep focus work wearing noise-canceling over-the-ear headphones, playing focus-friendly music. My family knows that when daddy wears the headphones, he is not to be disturbed unless there’s blood or fire. It mostly works.</p>
<p>Like many others, I used to work in an open-space office. Noise-wise and visual distraction-wise, open-space offices are possibly better than working from home. On more than one occasion, visitors from abroad have been impressed by the museum-grade silence filling a Swiss open-space office. But open-space offices offer a richer set of options for not concentrating on your deep work. Entire days can go by, being interrupted by colleagues, taking a walk to the cafeterias, listening in on neighboring conversations, attending more meetings than you should because you fear you’ll miss out. And the siren song of office perks, of course.</p>
<p>The choice is between perfect quiet filled with distractions, or constant information-free background sounds that you can learn to ignore with monk-like focus. I’ve tried it all and I know what works for me. Do you?</p>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/05/deep-silence-or-deep-work/</guid>
  <pubDate>Wed, 17 May 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Is The Ratio of Normal Variables Normal?</title>
  <link>https://blog.davidlindelof.com/posts/2023/05/auto-draft/</link>
  <description><![CDATA[ 






<p>In <em><a href="https://a.co/d/b6Emaxo">Trustworthy Online Controller Experiments</a></em> I came across this quote, referring to a ratio metric <img src="https://latex.codecogs.com/png.latex?M%20=%20%5C%5Cfrac%7BX%7D%7BY%7D">, which states that:</p>
<blockquote class="blockquote">
<p>Because <img src="https://latex.codecogs.com/png.latex?X"> and <img src="https://latex.codecogs.com/png.latex?Y"> are jointly bivariate normal in the limit, <img src="https://latex.codecogs.com/png.latex?M">, as the ratio of the two averages, is also normally distributed.</p>
</blockquote>
<p>That’s only partially true. According to <a href="Wikipedia">https://en.wikipedia.org/wiki/Ratio_distribution</a>, the ratio of two uncorrelated noncentral normal variables <img src="https://latex.codecogs.com/png.latex?X%20=%20N(%5C%5Cmu%5C_X,%20%5C%5Csigma%5C_X%5E2)"> and <img src="https://latex.codecogs.com/png.latex?Y%20=%20N(%5C%5Cmu%5C_Y,%20%5C%5Csigma%5C_Y%5E2)"> has mean <img src="https://latex.codecogs.com/png.latex?%5C%5Cmu%5C_X%20/%20%5C%5Cmu%5C_Y"> and variance approximately <img src="https://latex.codecogs.com/png.latex?%5C%5Cfrac%7B%5C%5Cmu%5C_X%5E2%7D%7B%5C%5Cmu%5C_Y%5E2%7D%5C%5Cleft(%20%5C%5Cfrac%7B%5C%5Csigma%5C_X%5E2%7D%7B%5C%5Cmu%5C_X%5E2%7D%20+%20%5C%5Cfrac%7B%5C%5Csigma%5C_Y%5E2%7D%7B%5C%5Cmu%5C_Y%5E2%7D%20%5C%5Cright)">. The article implies that this is true when <img src="https://latex.codecogs.com/png.latex?Y"> is unlikely to assume negative values, say <img src="https://latex.codecogs.com/png.latex?%5C%5Cmu%5C_Y%20%3E%203%20%5C%5Csigma%5C_Y">.</p>
<p>As always, the best way to believe something is to see it yourself. Let’s generate some uncorrelated normal variables far from 0 and their ratio:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">ux <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb1-2">sdx <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb1-3">uy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb1-4">sdy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span></span>
<span id="cb1-5"></span>
<span id="cb1-6">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> ux, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sdx)</span>
<span id="cb1-7">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> uy, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sdy)</span>
<span id="cb1-8">Z <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> X <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> Y</span></code></pre></div></div>
<p>Their ratio looks normal enough:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(Z)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/auto-draft/images/image-2-1024x731.png" class="img-fluid"></p>
<p>Which is confirmed by a q-q plot:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qqnorm</span>(Z)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/auto-draft/images/image-3-1024x731.png" class="img-fluid"></p>
<p>What about the mean and variance?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(Z)</span></code></pre></div></div>
<pre><code>[1] 1.998794</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">ux <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> uy</span></code></pre></div></div>
<pre><code>[1] 2</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">var</span>(Z)</span></code></pre></div></div>
<pre><code>[1] 0.001783404</code></pre>
<pre><code>ux^2 / uy^2 * (sdx^2 / ux^2 + sdy^2 / uy^2)</code></pre>
<pre><code>[1] 0.002</code></pre>
<p>Both the mean and variance are <em>very</em> close to their theoretical values.</p>
<p>But what happens now when the denominator <img src="https://latex.codecogs.com/png.latex?Y"> has a mean close to 0?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">ux <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb12-2">sdx <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb12-3">uy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span>
<span id="cb12-4">sdy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb12-5"></span>
<span id="cb12-6">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> ux, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sdx)</span>
<span id="cb12-7">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> uy, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sdy)</span>
<span id="cb12-8">Z <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> X <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> Y</span></code></pre></div></div>
<p>Hard to call the resulting ratio normally distributed:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hist</span>(Z)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/auto-draft/images/image-4-1024x731.png" class="img-fluid"></p>
<p>Which is also clear with a q-q plot:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">qqnorm</span>(Z)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2023/05/auto-draft/images/image-5-1024x731.png" class="img-fluid"></p>
<p>In other words, it is generally true that ratio metrics where the denominator is far from 0 will also be close enough to a normal distribution for practical purposes. But when the denominator’s mean is, say, closer than 5 sigmas from 0 that assumption breaks down.</p>



 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2023/05/auto-draft/</guid>
  <pubDate>Wed, 03 May 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Working with that data scientist</title>
  <link>https://blog.davidlindelof.com/posts/2023/04/working-with-that-data-scientist/</link>
  <description><![CDATA[ 






<p>In my current team we have decided to split up the work in a number of <em>workstreams</em>, which are in effect subteams responsible for different aspects of the product. One workstream might be responsible for product instrumentation, another for improving the recommendation algorithms, another responsible for the application’s look and feel. Each workstream has its own backlog and its own set of quarterly commitments, which map nicely to quarterly <a href="https://en.wikipedia.org/wiki/OKR">OKR</a>s.</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/04/working-with-that-data-scientist/images/image.png" class="img-fluid"></p>
<p>Workstreams aren’t necessarily disjoint: the same person might contribute to more than one work stream. Indeed for specialists (UX researchers, UI specialists, data science), that is almost the norm. As an aspiring data scientist myself, I contribute to several workstreams; I may entirely own a key result assigned to a workstream, or provide input (e.g.&nbsp;statistical advice, experiment sizing, etc) to another.</p>
<p>We don’t do <a href="https://en.wikipedia.org/wiki/Stand-up_meeting">daily standups</a>, not even among the software engineers. Instead we meet twice weekly for 30 minutes and review the current plans, update the board, and make sure no one is blocked.</p>
<p>We’ve adopted this process early this year. The response from the team has been generally positive. Compared to a more traditional front-end vs back-end division of labour, the team has cited the following benefits:</p>
<ul>
<li><p>tighter team cohesion</p></li>
<li><p>better understanding of what the others are working on</p></li>
<li><p>more productive team meetings</p></li>
<li><p>greater sense of accomplishments</p></li>
</ul>
<p>The main drawback with this system affects those of us in a more specialized role, such as UI, UX, or Data Science, who contribute to more than one workstream. We find ourselves compelled to attend the semi-weekly meetings of <em>all</em> the workstreams we are involved with, and never know which ones we can safely skip. On top of this I also have a weekly Data Science sync with the product manager.</p>
<p>At a recent retrospective we have agreed to mitigate these issues by the following:</p>
<ul>
<li><p>notes should be taken at all meetings, and the note-taker should remember to tag any team member who might be absent but who might need to know something important;</p></li>
<li><p>we will shorten the sync meetings to 15 minutes, and defragment them so that two workstreams could have their syncs done in the same half-hour (and sometimes the same room).</p></li>
</ul>
<p>I can’t say that this is the final perfect solution to embed a data scientist in a product team but at least we have an adaptive process in place: a system to regularly iterate on our processes and give the team permission to adapt their working agreements.</p>
<p>Are you a specialist embedded in a product team mostly made up of software engineers? How do you interact with the rest of the team? I’d love to hear your story in the comments below.</p>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/04/working-with-that-data-scientist/</guid>
  <pubDate>Thu, 20 Apr 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Controlling for covariates is not the same as “slicing”</title>
  <link>https://blog.davidlindelof.com/posts/2023/04/controlling-for-covariates-is-not-the-same-as-slicing/</link>
  <description><![CDATA[ 






<p>To detect small effects in experiments you need to reduce the experimental noise as much as possible. You can do it by working with larger sample sizes, but that doesn’t scale well. A far better approach consists in controlling for covariates that are correlated with your response.</p>
<p>I recently gave a talk at our company on the design of online experiments, and someone pointed out that our automated experiment analysis tool implemented “slicing”, that is, running separate analyses on subsets of the data. Wasn’t that the same thing as controlling for covariates?</p>
<p><img src="https://blog.davidlindelof.com/posts/2023/04/controlling-for-covariates-is-not-the-same-as-slicing/images/no-no-no-no-margaret-thatcher.gif" class="img-fluid"></p>
<p>Controlling for covariates means you include them in your statistical model. Running separate analyses means each of your sub-analyses has a smaller sample size; you may gain in precision because your response will be less variable in each subset, but you lose the benefits that come from using a larger sample size.</p>
<p>Let’s illustrate this with a simulation. Let’s say we wish to measure the impact of some treatment, whose effect is about 10 times smaller than the standard deviation of the error term:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">mu <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># some intercept</span></span>
<span id="cb1-2">err <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># standard deviation of the error</span></span>
<span id="cb1-3">treat_effect <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the treatment effect to estimate</span></span></code></pre></div></div>
<p>Let’s say we have a total of 1000 units in each arm of this two-sample experiment, and that they belong to 4 different equal-sized groups labeled&nbsp;<code>A</code>,&nbsp;<code>B</code>,&nbsp;<code>C</code>, and&nbsp;<code>D</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">n <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb2-2"></span>
<span id="cb2-3">predictor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb2-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb2-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">group =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gl</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'A'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'B'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'C'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'D'</span>)),</span>
<span id="cb2-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">treat =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gl</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treat'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'control'</span>)))</span></code></pre></div></div>
<p>Let’s simulate the response. For simplicity, let’s say that the group membership has an impact on the response equal to the treatment effect:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">group_effect <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> treat_effect</span>
<span id="cb3-2"></span>
<span id="cb3-3">response <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb3-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span>(</span>
<span id="cb3-5">    predictor,</span>
<span id="cb3-6">    mu <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.integer</span>(group) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> group_effect <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> (treat <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treat'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> treat_effect <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> err))</span>
<span id="cb3-7"></span>
<span id="cb3-8">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cbind</span>(predictor, response)</span>
<span id="cb3-9"></span>
<span id="cb3-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(df)</span></code></pre></div></div>
<pre><code> group       treat         response     
 A:500   treat  :1000   Min.   : 6.368  
 B:500   control:1000   1st Qu.: 9.607  
 C:500                  Median :10.335  
 D:500                  Mean   :10.325  
                        3rd Qu.:11.015  
                        Max.   :13.757  </code></pre>
<p>The following plot shows how the response is distributed in each group. This is one of those instances where you need statistical models to detect effects that are hard to see in a plot:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(df, ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> group, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> response)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb5-2">  ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_boxplot</span>()</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2023/04/controlling-for-covariates-is-not-the-same-as-slicing/images/image-1-1024x731.png" class="img-fluid"></p>
<p>Fitting the full model yields the following confidence intervals:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">mod_full <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(response <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> treat, df)</span>
<span id="cb6-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confint</span>(mod_full)</span></code></pre></div></div>
<pre><code>                    2.5 %      97.5 %
(Intercept)  10.122391874 10.32334506
groupB        0.009174272  0.26336218
groupC        0.106049411  0.36023732
groupD        0.116794148  0.37098206
treatcontrol -0.193199676 -0.01346168</code></pre>
<p>All coefficients are estimated correctly, and the width of the confidence interval of the treatment effect is about 0.18. The treatment effect is statistically significant. Recall that the error is taken to have a standard deviation of 1, and that <img src="https://latex.codecogs.com/png.latex?n=1000">&nbsp;per arm, so we would except the 95% confidence interval on the treatment effect to be&nbsp;<img src="https://latex.codecogs.com/png.latex?2%20%5C%5Ctimes%201.96%20%5C%5Ctimes%20%5C%5Csigma%20%5C%5Ctimes%20%5C%5Csqrt%7B2/n%7D">, or about 0.18. We are not very far off.</p>
<p>What happens now if, instead of controlling for the group, we “sliced” the analysis, i.e.&nbsp;we fit four separate models, one per group? On one hand we will have a smaller error than if we fitted a global model that did not control for the group covariate; on the other hand we will have fewer observations per group, which will hurt our confidence intervals. Let’s check:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confint</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(response <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> treat, df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subset =</span> group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'A'</span>))</span></code></pre></div></div>
<pre><code>                  2.5 %      97.5 %
(Intercept)  10.0482082 10.30293094
treatcontrol -0.2967421  0.06349024</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confint</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(response <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> treat, df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subset =</span> group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'B'</span>))</span></code></pre></div></div>
<pre><code>                  2.5 %      97.5 %
(Intercept)  10.2324904 10.48601591
treatcontrol -0.3688051 -0.01026586</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confint</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(response <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> treat, df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subset =</span> group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'C'</span>))</span></code></pre></div></div>
<pre><code>                 2.5 %      97.5 %
(Intercept)  10.303706 10.55182387
treatcontrol -0.309723  0.04116836</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">confint</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(response <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> treat, df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subset =</span> group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'D'</span>))</span></code></pre></div></div>
<pre><code>                  2.5 %      97.5 %
(Intercept)  10.4170354 10.66328177
treatcontrol -0.3825046 -0.03425957</code></pre>
<p>The estimates for the treatment effect remain unbiased, but now the confidence intervals are about 0.35—or&nbsp;<img src="https://latex.codecogs.com/png.latex?2%20%5C%5Ctimes%201.96%20%5C%5Ctimes%20%5C%5Csqrt%7B2/(n/4)%7D">, which is what you would expect for a sample size that’s four times smaller. That’s twice as large as when fitting the whole data with a model that includes the group covariate. In fact most of the groups now have statistically unsignificant results.</p>
<p>I’m all for automated experiment analysis tools; but when the goal is to detect small effects, I think there’s currently no substitute for a manual analysis by a trained statistician (which I am not). Increasing sample sizes can only take you so far; remember that the confidence intervals scale with&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Csqrt%7Bn%7D">&nbsp;only. It is almost always better to search for a set of covariates correlated with the response, and include them in your statistical model. And that’s what controlling for a covariate means.</p>



 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2023/04/controlling-for-covariates-is-not-the-same-as-slicing/</guid>
  <pubDate>Wed, 05 Apr 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Getting into data science</title>
  <link>https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/</link>
  <description><![CDATA[ 






<p>A while back I had the pleasure to address a team of user experience researchers at YouTube, and I got asked for a few resources that could help someone pretty good at science, math, and programming who wanted to get into data science. Here’s the list I gave. These have worked for me in the past, with the caveat that I’m <em>very</em> partial towards books.</p>
<section id="absolute-must-reads" class="level2">
<h2 class="anchored" data-anchor-id="absolute-must-reads"><strong>Absolute must-reads</strong></h2>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-1.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://www.statlearning.com/">An Introduction to Statistical Learning</a>&nbsp;</p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-2.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://jakevdp.github.io/PythonDataScienceHandbook/">Python Data Science Handbook</a></p>
</figcaption>
</figure>
<p>Both are freely available, outstanding books that cover a LOT of ground. The former uses R and goes somewhat deeper in theory, while the latter uses Python and is perhaps more practical, covering iPython, Numpy, and the scikit-learn ecosystem.</p>
</section>
<section id="great-too" class="level2">
<h2 class="anchored" data-anchor-id="great-too"><strong>Great too</strong></h2>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-3.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://learningstatisticswithr.com/">Learning Statistics with R</a></p>
</figcaption>
</figure>
<p>One of the clearest expositions of fundamental statistical concepts I’ve read. It’s also well written and avoids dry, lifeless prose; the author does a great job at discussing the pros and cons of each technique, and frequently gives templates on how to present the results. One of the most memorable passages was his/her (read the text to understand…) rant against the use of p-values AFTER looking at the data. Free book.</p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-4.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://r4ds.had.co.nz/">R for Data Science</a></p>
</figcaption>
</figure>
<p>Hadley Wickam’s companion book to <a href="https://www.tidyverse.org/">the tidyverse</a>. Essential reading if you’re into R and use the tidyverse. More oriented towards data manipulation and programming than actual statistical modeling. Free book.</p>
</section>
<section id="for-the-brave" class="level2">
<h2 class="anchored" data-anchor-id="for-the-brave"><strong>For the brave</strong></h2>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-5.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://web.stanford.edu/~hastie/ElemStatLearn/">The Elements of Statistical Learning</a></p>
</figcaption>
</figure>
<p>The “grown-up” version of ISLR (mentioned above). Covers a lot of theoretical ground, including a great discussion of the variance-bias tradeoff so beloved of interviewers. That book taught me to stop <a href="https://davidlindelof.com/feature-standardization-considered-harmful/">blindly normalizing covariates</a> before running clustering algorithms.</p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-6.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://hbiostat.org/doc/rms.pdf">Regression Modeling Strategies</a></p>
</figcaption>
</figure>
<p>Harrell is to statistics what Wickham is to data manipulation: the opinionated author of some amazing R packages that do a better job than the ones provided in base R. It’s a very dry text though, and probably better read in conjunction with <a href="https://www.nicholas-ollberding.com/post/an-introduction-to-the-harrell-verse-predictive-modeling-using-the-hmisc-and-rms-packages/">some explanatory blog posts</a>. Furthermore, it can be difficult to find resources online because these packages are not as widely adopted as the tidyverse.</p>
</section>
<section id="summer-reading" class="level2">
<h2 class="anchored" data-anchor-id="summer-reading"><strong>Summer reading</strong></h2>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-7.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X">Data Science from Scratch</a></p>
</figcaption>
</figure>
<p><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">Joel Grus is amazing</a>. In this book he shows how to code (and test!) many constructs used in Data Science, culminating with a pseudo-relational database.</p>
</section>
<section id="oh-you-think-you-know-statistics" class="level2">
<h2 class="anchored" data-anchor-id="oh-you-think-you-know-statistics"><strong>Oh you think you know statistics?</strong></h2>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-8.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://www.amazon.com/Statistical-Evidence-Likelihood-Monographs-Probability/dp/0412044110">Statistical Evidence</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/images/image-9.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://www.amazon.com/Causal-Inference-Statistics-Judea-Pearl/dp/1119186846">Causal Inference in Statistics: A Primer</a></p>
</figcaption>
</figure>
<p>I’m including these two books because I think reading them will make you a better statistician. The former is a short but mind-blowing read that will make you rethink every analysis you’ve ever done. The latter is the must-read text if you’re going to do any kind of causal inference.</p>
</section>
<section id="non-book-resources" class="level2">
<h2 class="anchored" data-anchor-id="non-book-resources"><strong>Non-book resources</strong></h2>
<p><a href="https://www.coursera.org/learn/machine-learning">Machine Learning</a></p>
<p><a href="https://www.coursera.org/specializations/deep-learning">Deep Learning</a></p>
<p><a href="https://www.udacity.com/course/ai-artificial-intelligence-nanodegree--nd898">AI nanodegree</a></p>
<p>These are some online courses I’ve taken and which I can wholeheartedly recommend, especially the first one which covers pretty much most concepts used in DS / ML. The Deep Learning specialization is more oriented towards neural networks, while Udacity’s AI nanodegree has probably nothing to do with DS but is a great intro to topics like building game-playing AI or path-finding algorithms.</p>
<p>Am I missing something? Feel free to add your own recommendations in the comments below.</p>


</section>

 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/03/getting-into-data-science/</guid>
  <pubDate>Wed, 22 Mar 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>The law of total probability applied to a conditional probability</title>
  <link>https://blog.davidlindelof.com/posts/2023/03/the-law-of-total-probability-applied-to-a-conditional-probability/</link>
  <description><![CDATA[ 






<p>Dear future self,</p>
<p>I’ve just lost (again) about half an hour of my life trying to find a vaguely remembered formula that generalizes the law of total probability to the case of conditional probabilities. Here it is. You’re welcome.</p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/03/the-law-of-total-probability-applied-to-a-conditional-probability/images/conditional_risk.png" class="img-fluid figure-img"></p>
<figcaption>
<p><em>So what is the probability of dying from a lighting strike if you’re an American who knows this statistic?</em></p>
</figcaption>
</figure>
<p>The law of total probability says that if you can decompose the set of possible events into disjoint subsets (say <img src="https://latex.codecogs.com/png.latex?B"> and <img src="https://latex.codecogs.com/png.latex?%5C%5Coverline%7BB%7D">), then (with obvious generalization to more than two subsets):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5C%5CPr(A)%20=%20%5C%5CPr(A%20%5C%5Cmid%20B)%20%5C%5CPr(B)%20+%20%5C%5CPr(A%20%5C%5Cmid%20%5C%5Coverline%7BB%7D)%20%5C%5CPr(%5C%5Coverline%7BB%7D)"></p>
<p>But what if you’re dealing with <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(A%20%5C%5Cmid%20C)"> instead of just <img src="https://latex.codecogs.com/png.latex?%5C%5CPr(A)">? What’s the formula for the law of total probability in that case? What you’re searching for can be found by googling for “total law probability conditional”:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5C%5CPr(A%20%5C%5Cmid%20C)%20=%20%5C%5CPr(A%20%5C%5Cmid%20B,%20C)%20%5C%5CPr(B%20%5C%5Cmid%20C)%20+%20%5C%5CPr(A%20%5C%5Cmid%20%5C%5Coverline%7BB%7D,%20C)%20%5C%5CPr(%5C%5Coverline%7BB%7D%20%5C%5Cmid%20C)%20"></p>
<p>There’s a great derivation here: <a href="https://math.stackexchange.com/questions/2377816/applying-law-of-total-probability-to-conditional-probability">https://math.stackexchange.com/questions/2377816/applying-law-of-total-probability-to-conditional-probability</a>.</p>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/03/the-law-of-total-probability-applied-to-a-conditional-probability/</guid>
  <pubDate>Wed, 08 Mar 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>xkcd and Data Science</title>
  <link>https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/</link>
  <description><![CDATA[ 






<p>I’ve been collecting all <a href="https://xkcd.com/">xkcd comics</a> related to Data Science and/or Statistics. Here they are, but if you think I’m missing any please let me know in the comments. Use at will in your data visualisations but remember to attribute. Sorted in reverse chronological order.</p>
<p><a href="https://xkcd.com/3007/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/probabilistic_uncertainty_2x.png" class="img-fluid"></a></p>
<p><a href="https://xkcd.com/2918/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/tick_marks_2x.png" class="img-fluid"></a></p>
<p><a href="https://xkcd.com/2884/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/log_alignment_2x.png" class="img-fluid"></a></p>
<p><a href="https://xkcd.com/2899/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/goodharts_law_2x.png" class="img-fluid"></a></p>
<p><a href="https://xkcd.com/2864/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/compact_graphs_2x.png" class="img-fluid"></a></p>
<p><a href="https://xkcd.com/2755/"><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/effect_size_2x.png" class="img-fluid"></a></p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/k_means_clustering.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2731/">K-Means Clustering</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/methodology_trial.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2726/">Methodology Trial</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/euler_diagrams.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2721/">Euler Diagrams</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/data_point.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2713/">Data Point</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/change_in_slope.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2701/">Change in Slope</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/proxy_variable.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2652">Proxy Variable</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/health_data.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2620">Health Data</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/garbage_math.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2295/">Garbage Math</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/selection_bias.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2618/">Selection Bias</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/spacecraft_debris_odds_ratio.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2599/">Spacecraft Debris Odds Ratio</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/control_group.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2576/">Control Group</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/confounding_variables.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2560/">Confounding Variables</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/bayes_theorem.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2545/">Bayes’ Theorem</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/slope_hypothesis_testing.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2533/">Slope Hypothesis Testing</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/flawed_data.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2494/">Flawed Data</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/error_types.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2303/">Error Types</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/modified_bayes_theorem.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2059/">Modified Bayes’ Theorem</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/curve_fitting.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/2048/">Curve-Fitting</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/machine_learning.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1838/">Machine Learning</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/linear_regression.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1725/">Linear Regression</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/p_values.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1478/">P-Values</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/t_distribution.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1347/">t Distribution</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/increased_risk.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1252/">Increased Risk</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/seashell.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1236/">Seashell</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/log_scale.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/1162/">Log Scale</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/cell_phones.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/925/">Cell Phones</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/significant.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/882/">Significant</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/conditional_risk.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/795/">Conditional Risk</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/correlation.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/552/">Correlation</a></p>
</figcaption>
</figure>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/images/boyfriend.png" class="img-fluid figure-img"></p>
<figcaption>
<p><a href="https://xkcd.com/539/">Boyfriend</a></p>
</figcaption>
</figure>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/02/data-science-the-xkcd-edition/</guid>
  <pubDate>Mon, 20 Feb 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Quick note about bootstrapping</title>
  <link>https://blog.davidlindelof.com/posts/2023/02/quick-note-about-bootstrapping/</link>
  <description><![CDATA[ 






<p>Cross-validation—the act of keeping a subset of data to measure the performance of a model trained on the rest of the data—never sounded right to me.</p>
<p>It just doesn’t feel optimal to retain an arbitrary fraction of the data when you train your model. Oh and then you’re also supposed to keep <em>another</em> fraction for <em>validating</em> the model. So one set for training, one set for testing (to find the best model structure), and one set for validating the model, i.e.&nbsp;measuring its performance. That’s throwing away quite a lot of data that could be used for training.</p>
<p>That’s why I was excited to learn that bootstrapping provides an alternative. Bootstrapping is an elegant way to maximize the use of the available data, typically when you want to estimate confidence intervals or any other statistic.</p>
<p>In “<a href="https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/ref=sr_1_4?crid=OG31K5T3TW79&amp;keywords=Applied+Predictive+Modelling&amp;qid=1652540040&amp;sprefix=applied+predictive+modelling%2Caps%2C192&amp;sr=8-4">Applied Predictive Modelling</a>”, the authors discuss resampling techniques, which include bootstrapping and cross-validation (p.&nbsp;72). The authors explain that bootstrap validation consists in building <em>N</em> models with bootstrapped data and estimating their performance on the out-of-bag samples, i.e.&nbsp;the samples not used in building the model.</p>
<p>I think that may be an error. I don’t have <a href="https://www.amazon.com/Introduction-Bootstrap-Monographs-Statistics-Probability/dp/0412042312/ref=sr_1_1?crid=1MSUNDIAEZ2A1&amp;keywords=efron+bootstrap&amp;qid=1652540247&amp;sprefix=efron+bootstrap%2Caps%2C170&amp;sr=8-1">Efron’s seminal book on the bootstrap</a> anymore but I’m pretty sure the accuracy was evaluated against the entire data set, not just the out-of-bag samples.</p>
<p>In “<a href="https://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/331933039X/ref=sr_1_1?crid=3FSNZYETV1W07&amp;keywords=Regression+Modelling+Strategies&amp;qid=1652540658&amp;sprefix=regression+modeling+strategies%2Caps%2C408&amp;sr=8-1">Regression Modelling Strategies</a>”, Frank Harrell describes model validation with the bootstrap thus (emphasis mine):</p>
<blockquote class="blockquote">
<p>With the “simple bootstrap” [178, p.&nbsp;247], one repeatedly fits the model in a bootstrap sample and evaluates the performance of the model on the <strong>original sample</strong>. The estimate of the likely performance of the final model on future data is estimated by the average of all of the indexes computed on the original sample.</p>
<p>Frank Harrell, Regression Modelling Strategies</p>
</blockquote>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2023/02/quick-note-about-bootstrapping/</guid>
  <pubDate>Mon, 06 Feb 2023 00:00:00 GMT</pubDate>
</item>
<item>
  <title>The most under-rated programming books</title>
  <link>https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/</link>
  <description><![CDATA[ 






<p>Ask any programmer what their favourite programming book is, and their answer will be one of the usual suspects: <em>Code Complete, The Pragmatic Programmer,</em> or <em>Design Patterns.</em> And rightly so; these are outstanding and highly-regarded works that belong to every programmer’s bookshelf. (If you’re just starting out building up your bookshelf, <a href="https://blog.codinghorror.com/recommended-reading-for-developers/">Jeff Atwood has some great recommendations</a>).</p>
<p>But once you get past the “essential” books you’ll find that there are many incredibly good programming books out there that people don’t talk much about, but which were essential in taking me to the next levels in my professional growth.</p>
<p>Here’s a partial list of such books; I’m sure there are many others, feel free to mention them in the comments.</p>
<section id="growing-object-oriented-software-guided-by-tests" class="level2">
<h2 class="anchored" data-anchor-id="growing-object-oriented-software-guided-by-tests"><a href="https://lesen.amazon.de/kp/embed?asin=B002TIOYVW&amp;preview=newtab&amp;linkCode=kpe&amp;ref_=cm_sw_r_kb_dp_GZV27G0F7TCFK4VE5JM6">Growing Object-Oriented Software, Guided by Tests</a></h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/B002TIOYVW.01.L.jpg" class="img-fluid figure-img"></p>
<figcaption>Cover of “Growing Object-Oriented Software, Guided by Tests</figcaption>
</figure>
</div>
<!--more-->
<p>Imagine looking over the shoulders of a master programmer as she develops a real-world application, feature by feature, beginning each feature with an automated end-to-end integration test, and ending it with a round of refactoring. This is what you get reading this book, as the authors walk you through the development of an automated auction bidding system.</p>
<p>Unlike traditional books on test-driven development (TDD), this one begins by setting up a test harness around the <em>entire system</em>, simulating the chat-based API of the auction system. That makes it possible to write end-to-end tests for each use case, <em>before</em> switching to TDD to write the classes that implement the use case.</p>
<p>It’s the only book I know that covers a complete case study like this. It’s quite possibly the only programming book I’ve read twice, and it has had a profound impact on the way I now develop machine-learning systems. This book inspired me to begin a machine-learning project in R with a test harness mimicking the production database systems; <a href="https://davidlindelof.com/machine-learning-in-r-start-with-an-end-to-end-test/">I’ve given an overview of the project elsewhere</a>.</p>
<p>This book also helped me understand the importance of mocking class collaborators: classes whose behaviour needs to be stubbed in order to write unit tests. I used to frown on excessive use of collaborators; now I fully embrace them. But one of the book’s key takeaway messages is perhaps too easy to overlook in some languages such as Python: <em>never mock a class you don’t own</em>. Except for primitive, stable classes (such as <code>String</code>), it’s almost always better to write a thin wrapper for third-party libraries (heavily covered by its own integration tests), and then mock <em>the wrapper</em> (which you own) instead of the third-party library (which you don’t).</p>
</section>
<section id="your-code-as-a-crime-scene" class="level2">
<h2 class="anchored" data-anchor-id="your-code-as-a-crime-scene"><a href="https://lesen.amazon.de/kp/embed?asin=B00ZB5XWBI&amp;preview=newtab&amp;linkCode=kpe&amp;ref_=cm_sw_r_kb_dp_TFMG2EGZE39B63Q4GPEX">Your Code as a Crime Scene</a></h2>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/B00ZB5XWBI.01.L.jpg" class="img-fluid"></p>
<p>Version-control systems such as Git are essential for coordinating work between team members and for occasionally rolling back a system that fails in production. But they also provide priceless insights into potential issues with your development process.</p>
<p>Adam Tornhill wrote this amazing introduction to <em>Software Forensics</em>: mining the history of your version-control system to detect signs of bad design (such as overly complex classes or excessive coupling), and forecasting where bugs are more likely to lurk.</p>
<p>Every chapter is immediately actionable, whether by a software engineering manager or by the team itself. During my startup days I was able to run this on our codebase, correlating change frequency with module size, and came up with a <a href="https://davidlindelof.com/predicting-where-the-bugs-are/">heatmap I’ve documented elsewhere</a>, showing where to focus testing efforts. Sure enough, the largest, most frequently changed modules were the most prone to defects.</p>
<p>It’s a novel way to exploit the information in your version-control system which, as far as I know, has never been proposed elsewhere (and certainly not in software engineering classes). As an additional benefit, it may also put more pressure on the team to be disciplined in keeping the version-control system log clean and tidy.</p>
</section>
<section id="applying-uml-and-patterns" class="level2">
<h2 class="anchored" data-anchor-id="applying-uml-and-patterns"><a href="https://www.amazon.com/dp/0131489062/ref=cm_sw_em_r_mt_dp_HKGR39RMX1RNMV3C6M6T">Applying UML and Patterns</a></h2>
<p><a href="https://www.amazon.com/dp/0131489062"><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/51gVLEtrCNL._SX398_BO1,204,203,200_.jpg" class="img-fluid"></a></p>
<p>When used correctly, UML can be a great tool for communicating design decisions in an unambiguous manner. Even as a data scientist, I frequently use UML in my notes to understand the relationships between the entities represented in the datasets. Sadly, misguided attempts to generate code from UML and other myths have given UML an undeserved reputation for being a failed attempt at design formalism, and UML doesn’t seem to be widely used (or understood) these days.</p>
<p>This book is another “peek over the shoulders of a giant” book where we follow the evolution of two non-trivial applications: a board game and a cash register. Design decisions are expressed with UML throughout the book and updated as the developer learns more about the problem space. I still vividly remember the <em>Aha!</em> moment when the author realised that the playing piece and the player were not separate concepts, and did not need their own classes; they were effectively the same class.</p>
</section>
<section id="the-little-schemer-series" class="level2">
<h2 class="anchored" data-anchor-id="the-little-schemer-series"><a href="https://www.amazon.com/dp/0262560992/ref=cm_sw_em_r_mt_dp_2ZRT5Q0DPDT2KB8NGAP2">The Little Schemer</a> series</h2>
<p><a href="https://www.amazon.com/dp/0262560992"><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/41d3h12T4PL._SX403_BO1,204,203,200_.jpg" class="img-fluid"></a></p>
<p><em>Is it true that this is an atom?</em><br>
<code>atom</code></p>
<p>Yes, because <code>atom</code> is a string of characters beginning with the letter <code>a</code>.</p>
<p>Hard to forget the opening question of <em>The Little Schemer</em>, a mind-blowing exposition of the Scheme programming language that begins with atoms (as above) and ends with the <a href="https://en.wikipedia.org/wiki/Fixed-point_combinator#Y_combinator">Y-combinator</a> and a Scheme parser. It’s been said before and I’ll say it again: if you intend to be a professional programmer you need to learn a LISP dialect such as Scheme. I’m not saying you need to be proficient in it, or even to write your own program in it; but you need to understand its paradigms, and see how far it’s possible to push the code-as-data concept that’s been slowly but surely re-discovered in modern programming languages.</p>
</section>
<section id="refactoring-databases" class="level2">
<h2 class="anchored" data-anchor-id="refactoring-databases"><a href="https://www.amazon.com/dp/B001QAP36E/ref=cm_sw_em_r_mt_dp_CNR605ZVVD62TC5N8CHS">Refactoring Databases</a></h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/51mMhFDXhHL.jpg" class="img-fluid figure-img"></p>
<figcaption>Refactoring Databases: Evolutionary Database Design (Addison-Wesley Signature Series (Fowler)) by [Scott W. Ambler, Pramod J. Sadalage, Martin Fowler, John Graham, Sachin Rekhi, Paul Dorsey]</figcaption>
</figure>
</div>
<p>Databases are frequently used by more than one application in a given organization. Therefore, changing the structure of the schema as the need arises becomes a scary proposition, because one small change to the schema can affect an unknown number of dependents. But it doesn’t need to be that way. Indeed, refactoring a database is just as desirable as refactoring traditional computer code, or even more so. This book shows you how to do it in a safe way, how to communicate the changes to your stakeholders, and how to give them enough time to adapt to the changes.</p>
<p><em>Refactoring Databases</em> contains a number of techniques to improve the structure of your database as you understand better the business in which it operates. It teaches you techniques based on views and triggers that let you gradually roll out changes over time, announce them to stakeholders, give them a deadline by when they must adapt to the new changes, and deprecate the old schema.</p>
<p>In a prior gig I applied the techniques in this book and was able to maintain <em>three</em> different schemas concurrently in the same database. We never felt the need to “finish” any schema migration because the system of views and triggers made it possible to support the three schemas indefinitely.</p>
<p>So who should read this book? Database administrators would be an obvious answer; after all, they are the ones who are going to apply the techniques in this book. But more broadly than that, the whole development team needs to be aware of these techniques because they make possible what was thought to be impossible, that is, rolling out changes to the database without breaking any dependent application.</p>
<p>These are techniques that software developers, who work on an application talking to a database, must be aware of. Furthermore, if your organisation is agile enough to change its database schema, you need to be aware of this possibility. Therefore you need to structure your application so it becomes immune to those changes, and this book will show you how.</p>
</section>
<section id="why-programs-fail" class="level2">
<h2 class="anchored" data-anchor-id="why-programs-fail"><a href="https://www.amazon.com/dp/B0092L8LCW/ref=cm_sw_em_r_mt_dp_6SYV5NRSK8Z0CGD5FX6Q">Why Programs Fail</a></h2>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/51sQtQzMs2L._SX402_BO1,204,203,200_.jpg" class="img-fluid"></p>
<p>Between 40% and 80% of software costs are spent on maintenance, adding new features, or fixing bugs (<a href="https://www.amazon.com/dp/B001TKD4RG/ref=cm_sw_em_r_mt_dp_4Y79VTTW33F60NZPGD63">Facts and Fallacies of Software Engineering</a>). I’m not sure how much time is spent fixing bugs alone but it’s clearly a significant part of the software lifecycle costs. Yet all our curricula and programming books are mainly focused on the initial software development part.</p>
<p><em>Why Programs Fail</em> is the only book I know that is entirely devoted to the subject of debugging. Instead of the traditional method of stepping through the program with a debugger, mindlessly observing the program until something doesn’t seem right, Andrea Zeller proposes a far more active use of the debugger, informed by hypothesis testing and the construction of mental models of how the program should behave.</p>
<p>This book taught me a method to find the root cause, or fault, of any software error by successively refuting a sequence of hypotheses on the cause of the error. My engineering notebooks are full of entries that follow the same pattern:</p>
<ol type="1">
<li>Form a hypothesis on what the defect might be (<em>can it be that the average of this array of floats is smaller than all the array elements?</em>)</li>
<li>Write a prediction based on the hypothesis (<em>with this set of inputs, the variable <code>avg</code> will be smaller than the elements, triggering the assertion failure on next line</em>)</li>
<li>Run an experiment that will refute the prediction if it’s wrong; typically, this means running the code in the debugger, setting local variables or function arguments to the desired values</li>
<li>Observe the output (<em>wow, indeed <code>avg</code> is smaller than all the elements</em>)</li>
<li>Confirm of refute the original hypothesis. Refine the hypothesis (<em>maybe this is caused by floating-point rounding errors?</em>), and return to 1) until the defect has been isolated. <a href="https://davidlindelof.com/where-all-floating-point-values-are-above-average/">Maybe write a blog post about it</a>.</li>
</ol>
<p>There’s a lot more in the book and Andreas Zeller also has <a href="https://www.udacity.com/course/software-debugging--cs259">a highly recommended free course on Udacity covering the same topics</a>, such as delta-debugging (reducing failure-causing input to the smallest possible failing case) and fuzz testing (randomly evolving the program input to find failures). Both techniques are used today in advanced error-finding tools such as <a href="https://hypothesis.readthedocs.io/en/latest/">Hypothesis</a> and the <a href="https://github.com/google/AFL">American Fuzzy Lop</a>. Both have been parts of my standard toolkit for years.</p>
</section>
<section id="honorable-mentions" class="level2">
<h2 class="anchored" data-anchor-id="honorable-mentions">Honorable mentions</h2>
<p>The following books are not strictly speaking programming books, yet I believe they belong on the shelf of any serious programmer who cares about their craft or their brain.</p>
<section id="pragmatic-thinking-and-learning" class="level3">
<h3 class="anchored" data-anchor-id="pragmatic-thinking-and-learning"><a href="https://www.amazon.com/dp/1934356050/ref=cm_sw_em_r_mt_dp_K4Y19YEF0SWR98SPN986">Pragmatic Thinking and Learning</a></h3>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/51vzTtzCFmL._SX415_BO1,204,203,200_.jpg" class="img-fluid"></p>
<p>Andy Hunt, co-author of <em>The Pragmatic Programmer</em> and co-founder of <em>The Pragmatic Bookshelf</em>, has collected in a single accessible volume all that you need to know about how your brain works and how to use it better. He does an excellent job at explaining the two main modes by which the brain operates, the <em>Rich</em> mode and the <em>Linear</em> mode, and the importance of regularly switching from one to the other.</p>
<p>The book also includes great tips on how to learn efficiently, how to manage your focus, and how to think of your own journey towards expertise.</p>
<p>Over the years I’ve come across many other resources that discuss how the brain works and how to make it work better, but there was almost never anything new that hadn’t been described by Andy in this book.</p>
</section>
</section>
<section id="hackers-delight" class="level2">
<h2 class="anchored" data-anchor-id="hackers-delight"><a href="https://www.amazon.com/dp/0321842685/ref=cm_sw_em_r_mt_dp_G2AQSVZP5NQK4FP2HVWR">Hacker’s delight</a></h2>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/412u4+9U3LL._SX339_BO1,204,203,200_.jpg" class="img-fluid"></p>
<p>On the face of it, <em>Hacker’s Delight</em> will only be of interest to compiler writers. Who else needs to know that</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode cpp code-with-copy"><code class="sourceCode cpp"><span id="cb1-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span></span>
<span id="cb1-2">  x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="cb1-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span></span>
<span id="cb1-4">  x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span></code></pre></div></div>
<p>can be replaced with a more efficient</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode cpp code-with-copy"><code class="sourceCode cpp"><span id="cb2-1">x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span> b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span></code></pre></div></div>
<p>I freely admit that there’s nothing in this book that I could ever have found useful in my day to day work, yet I love this book. As the title implies, reading this book is a <em>delight</em> for the curious mind who wants to dive deeper into how machines work. It truly is a delightful, beautifully typeset book that belongs right up there next to your <em>Art of Computer Programming</em> series.</p>
</section>
<section id="concrete-mathematics" class="level2">
<h2 class="anchored" data-anchor-id="concrete-mathematics"><a href="https://www.amazon.com/dp/0201558025/ref=cm_sw_em_r_mt_dp_8FJH12JKM95GXQKB440P">Concrete mathematics</a></h2>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/images/61oqP3wQsiL._SX350_BO1,204,203,200_.jpg" class="img-fluid"></p>
<p>When it comes to day-to-day programming, <em>Concrete Mathematics</em> is probably the least useful book on this list–but boy was it a joy to read. Clearly intended for a theoretical computer scientist, this book is far more accessible than its older cousins forming the <em>Art of Computer Programming</em> series. It covers everything you need to know about the analysis of algorithms and related topics.</p>
<p>But what’s really fascinating about this book is the way it shows how a mathematician <em>thinks</em>. Take for example the analysis of the Tower of Hanoi algorithm that introduces the book. It starts with the first few examples, which are enough to form a guess of what the most general formula is; and this guess later informs the search for mathematical proofs. This is <em>not</em> how the analysis of algorithms is generally taught, and I’m really grateful to the authors for showing me that they are human too.</p>
<p>Oh, and did I mention the marginal notes contributed by the authors’ students? These are worth the price of the book alone, such as this gem:</p>
<p><em>The summation symbol looks like a distorted pacman.</em></p>


</section>

 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2021/06/the-most-under-rated-programming-books/</guid>
  <pubDate>Wed, 16 Jun 2021 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Feature standardization considered harmful</title>
  <link>https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/</link>
  <description><![CDATA[ 






<p>Many statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described.</p>
<p>The same advice is frequently given for K-means clustering (see <a href="https://datascience.stackexchange.com/q/22795/69539">Do Clustering algorithms need feature scaling in the pre-processing stage?</a>, <a href="https://stats.stackexchange.com/q/21222/4370">Are mean normalization and feature scaling needed for k-means clustering?</a>, and <a href="https://stats.stackexchange.com/q/372521/4370">In cluster analysis should I scale (standardize) my data if variables are in the same units?</a>), but there’s a great counter-example given in <a href="https://g.co/kgs/wvHyxB">The Elements of Statistical Learning</a> that I try to reproduce here.</p>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/images/img_7703.jpg" class="img-fluid"></p>
<p>Consider two point clouds (<img src="https://latex.codecogs.com/png.latex?n=50">&nbsp;each), randomly drawn around two origins 3 units away from the origin:</p>
<!--more-->
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">495</span>)</span>
<span id="cb1-2">n <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb1-3">d <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb1-4">x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(n <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb1-5">x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>(n<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>(n<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> d</span>
<span id="cb1-6">x[(n<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>n, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x[(n<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>n, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> d</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/images/image-1024x731.png" class="img-fluid"></p>
<p>The K-means algorithm has no problem in classifying these points:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">km <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kmeans</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">centers =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb2-2">km<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>centers</span></code></pre></div></div>
<pre><code>##        [,1]         [,2]
## 1  2.922143  0.098422541
## 2 -2.991026 -0.003131757</code></pre>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/images/image-1-1024x731.png" class="img-fluid"></p>
<p>Let’s see now what happens when we standardize each feature. Since their mean is already zero, we merely divide by their standard deviation:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">x_scaled <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x</span>
<span id="cb4-2">x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span>
<span id="cb4-3">x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(x_scaled[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])</span></code></pre></div></div>
<p>And we run again the K-means algorithm on these new data:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">km_scaled <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kmeans</span>(x_scaled, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">centers =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div></div>
<p><img src="https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/images/image-2-1024x731.png" class="img-fluid"></p>
<p>We see that K-means has completely failed to identify the clusters, because ‘standardizing’ the features has destroyed the clear separation between the clusters.</p>
<p>So what’s the lesson here? Clearly, for K-means you should not blindly standardize the features unless there are clear reasons to do so. In this toy example, we didn’t know what the features represent, so it’s impossible to say whether standardizing the features was the right thing to do. Perhaps the clusters seen pre-standardization were mere artefacts of our choice of units! As a rule of thumb, I suggest that features that are expressed in the same units and that represent the same ‘stuff’ (such as width and length) should not be standardized. If you have deeper insights into this I’d love to hear your comments.</p>



 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2021/06/feature-standardization-considered-harmful/</guid>
  <pubDate>Fri, 11 Jun 2021 00:00:00 GMT</pubDate>
</item>
<item>
  <title>No, you have not controlled for confounders</title>
  <link>https://blog.davidlindelof.com/posts/2021/02/no-you-have-not-controlled-for-confounders/</link>
  <description><![CDATA[ 






<p>When observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”.</p>
<p>This approach is wrong. Very wrong. At least as wrong as that DIY electrical job you did last week: it looks all good and neat but you’ve made a critical mistake and there’s no way you can find out without killing yourself.</p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2021/02/no-you-have-not-controlled-for-confounders/images/image.png" class="img-fluid figure-img"></p>
<figcaption>
<p><em>Or worse, by thinking you’ve controlled for confounders when you haven’t</em></p>
</figcaption>
</figure>
<p>I can’t explain&nbsp;<em>why</em>&nbsp;this is wrong (I’m not sure I understand it myself) but I can&nbsp;<em>show</em>&nbsp;you some examples proving that this approach is wrong. We’ll work through a few examples, where we compare the results with a traditional regression with a couple of legit causal inference libraries. Since we use simulated data, we’ll also be able to compare with the “true” treatment effect.</p>
<!--more-->
<p>In all the following examples,&nbsp;<img src="https://latex.codecogs.com/png.latex?X">&nbsp;will be a&nbsp;<img src="https://latex.codecogs.com/png.latex?N%20%5C%5Ctimes%2010">&nbsp;matrix of random covariates;&nbsp;<img src="https://latex.codecogs.com/png.latex?W">&nbsp;will be a random treatment binary indicator (which may or may not depend on&nbsp;<img src="https://latex.codecogs.com/png.latex?X">);&nbsp;<img src="https://latex.codecogs.com/png.latex?T">&nbsp;will be the treatment effect;&nbsp;<img src="https://latex.codecogs.com/png.latex?E">&nbsp;will be the main effects; and&nbsp;<img src="https://latex.codecogs.com/png.latex?Y">&nbsp;will be the outcome variable (<img src="https://latex.codecogs.com/png.latex?Y%5E%7B(0)%7D"> if untreated, <img src="https://latex.codecogs.com/png.latex?Y%5E%7B(1)%7D"> if treated), so that&nbsp;<img src="https://latex.codecogs.com/png.latex?Y%20%5C%5Csim%20%5C%5Cmathcal%7BN%7D(%20T%20W%20+%20E,%201)">. Our task is to estimate&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Cmathbb%7BE%7D(Y%5E%7B(1)%7D%20-%20Y%5E%7B(0)%7D)">, the conditional average treatment effect (CATE) and&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Cmathbb%7BE%7D(Y%5E%7B(1)%7D%20-%20Y%5E%7B(0)%7D%20%5C%5Cmid%20W%20=%201)">, the conditional average treatment effect on the treated (CATT).</p>
<p>For each example we’ll estimate the treatment effects using:</p>
<ul>
<li>the causal forests from the&nbsp;<a href="https://grf-labs.github.io/grf/">grf package</a></li>
<li>the double machine learning approach from the&nbsp;<a href="https://github.com/MCKnaus/dmlmt">dmlmt package</a></li>
<li>a random forest, using the&nbsp;<a href="https://github.com/imbs-hl/ranger">ranger package</a>, trained on the entire dataset</li>
<li>two random forests (again from ranger) trained separately on the treated units and on the untreated units.</li>
<li>a linear regression</li>
</ul>
<p>We’ll begin with the examples given in the&nbsp;<code>grf</code>&nbsp;package’s documentation, but first we load some required packages and set the size of the problem:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(grf)</span>
<span id="cb1-2">devtools<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install_github</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">repo =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"MCKnaus/dmlmt"</span>)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dmlmt)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ranger)</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(purrr)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for rbernoulli</span></span>
<span id="cb1-6">N <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20000</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of observations</span></span>
<span id="cb1-7">P <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of covariates</span></span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1984</span>)</span></code></pre></div></div>
<section id="case-1-example-from-causal_forests-help-page" class="level3">
<h3 class="anchored" data-anchor-id="case-1-example-from-causal_forests-help-page">Case 1: example from&nbsp;<code>causal_forest</code>’s help page</h3>
<p>This first example is the one given in&nbsp;<code>causal_forest</code>’s help page, where the treatment assignment&nbsp;<img src="https://latex.codecogs.com/png.latex?W">&nbsp;is completely randomized and the outcome only depends on&nbsp;<img src="https://latex.codecogs.com/png.latex?X%5C_1">,&nbsp;<img src="https://latex.codecogs.com/png.latex?X%5C_2">, and&nbsp;<img src="https://latex.codecogs.com/png.latex?X%5C_3">:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(</span>
<span id="cb2-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> P),</span>
<span id="cb2-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> N,</span>
<span id="cb2-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> P,</span>
<span id="cb2-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dimnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>P))</span>
<span id="cb2-6">)</span>
<span id="cb2-7">W <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbernoulli</span>(N, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb2-8">T <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmax</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb2-9">E <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmin</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb2-10">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> T <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> W <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span></code></pre></div></div>
<p>The theoretical CATE is identical to the theoretical CATT and is given by the average positive part of the normal distribution, which is just&nbsp;<img src="https://latex.codecogs.com/png.latex?1%20/%20%5C%5Csqrt%7B2%5C%5Cpi%7D">:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> pi))</span></code></pre></div></div>
<pre><code>## [1] 0.3989423</code></pre>
<p>The empirical CATE and CATT agree very well with the theoretical value:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># empirical CATE</span></span></code></pre></div></div>
<pre><code>## [1] 0.4012137</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T[W])  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># empirical CATT</span></span></code></pre></div></div>
<pre><code>## [1] 0.3995248</code></pre>
<p>Let’s see now how well the causal models recover the treatment effects. First the causal forest:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">c.forest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">causal_forest</span>(X, Y, W)</span>
<span id="cb9-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.39445852 0.01440154</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treated'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.39432334 0.01440661</code></pre>
<p>Pretty good. Next the Double Machine Learning approach:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">invisible</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmlmt</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
##  Binary treatment
## 
## 
## 
##  Potential outcomes:
##                     PO     SE
## Treatment 0 -0.3999623 0.0131
## Treatment 1 -0.0077989 0.0139
## 
## Average effects
##               TE       SE      t         p    
## T1 - T0 0.392163 0.015554 25.214 &lt; 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## # of obs on / off support: 19997  /  3</code></pre>
<p>Also very good. Let’s see now how a traditional regressor performs. There are two approaches. In the first one, we train a single model&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bf%7D()">&nbsp;on the entire dataset such that&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bf%7D(X,%20W)">&nbsp;estimates the outcome for covariates&nbsp;<img src="https://latex.codecogs.com/png.latex?X">&nbsp;given treatment assignment&nbsp;<img src="https://latex.codecogs.com/png.latex?W">; the CATE is then given by&nbsp;<img src="https://latex.codecogs.com/png.latex?%5C%5Chat%7Bf%7D(X,%201)%20-%20%5C%5Chat%7Bf%7D(X,%200)">.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1">ranger.model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1">ranger_cate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(ranger.model, X) {</span>
<span id="cb16-2">  data_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">W =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb16-3">  data_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">W =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb16-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(ranger.model, data_treated)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>predictions <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(ranger.model, data_untreated)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>predictions)</span>
<span id="cb16-5">}</span>
<span id="cb16-6"></span>
<span id="cb16-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate</span>(ranger.model, X)</span></code></pre></div></div>
<pre><code>## [1] 0.3480178</code></pre>
<p>Really bad. Let’s see if another approach might work better. In that second approach, we train two models:&nbsp;<img src="https://latex.codecogs.com/png.latex?f%5C_%7B(1)%7D(X%5C%5BW%5C%5D)">&nbsp;on the treated units,&nbsp;<img src="https://latex.codecogs.com/png.latex?f%5C_%7B(0)%7D(X%5C%5B%5C%5Cneg%20W%5C%5D)">&nbsp;on the untreated units. The CATE is then given by&nbsp;<img src="https://latex.codecogs.com/png.latex?f%5C_%7B(1)%7D(X)%20-%20f%5C_%7B(0)%7D(X)">:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1">model_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[W, ])</span>
<span id="cb18-2">model_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>W, ])</span>
<span id="cb18-3">ranger_cate_two_models <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(model_treated, model_untreated, X) {</span>
<span id="cb18-4">  data_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">W =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb18-5">  data_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">W =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb18-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model_treated, data_treated)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>predictions <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model_untreated, data_untreated)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>predictions) </span>
<span id="cb18-7">}</span>
<span id="cb18-8"></span>
<span id="cb18-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate_two_models</span>(model_treated, model_untreated, X)</span></code></pre></div></div>
<pre><code>## [1] 0.3945442</code></pre>
<p>Much better. And finally a simple linear regression:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5  
##   -0.396914     0.252459     0.991225     0.501298     0.004950     0.002065  
##          X6           X7           X8           X9          X10        WTRUE  
##   -0.002804    -0.011666     0.004563     0.010666     0.007237     0.392153</code></pre>
<p>Also very good estimate, which is what you’d expect on what is essentially a randomized trial.</p>
<p>So with this first dataset with no confounding we see that “proper” causal models dominate a traditional regressor, unless we train two separate regressors on the treated and untreated units. Let’s see now the other examples.</p>
</section>
<section id="case-2-example-with-confounding-from-causal_forests-home-page" class="level3">
<h3 class="anchored" data-anchor-id="case-2-example-with-confounding-from-causal_forests-home-page">Case 2: example with confounding from&nbsp;<code>causal_forest</code>’s home page</h3>
<p>The&nbsp;<a href="https://github.com/grf-labs/grf">home page</a>&nbsp;for the&nbsp;<code>grf</code>&nbsp;package has an example slightly different from the example above, in which the treatment assignment&nbsp;<img src="https://latex.codecogs.com/png.latex?W">&nbsp;depends on&nbsp;<img src="https://latex.codecogs.com/png.latex?X%5C_1">:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(</span>
<span id="cb22-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> P),</span>
<span id="cb22-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> N,</span>
<span id="cb22-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> P,</span>
<span id="cb22-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dimnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>P))</span>
<span id="cb22-6">)</span>
<span id="cb22-7">W <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbernoulli</span>(N, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>))</span>
<span id="cb22-8">T <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmax</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb22-9">E <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmin</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb22-10">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> T <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> W <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span></code></pre></div></div>
<p>The theoretical CATE is the same as above, but the theoretical CATT is slightly higher since being treated raises the expected treatment effect:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Theoretical CATE</span></span>
<span id="cb23-2"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>pi)</span></code></pre></div></div>
<pre><code>## [1] 0.3989423</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb25-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Empirical CATE</span></span>
<span id="cb25-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T)</span></code></pre></div></div>
<pre><code>## [1] 0.3933349</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Theoretical CATT</span></span>
<span id="cb27-2"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>pi) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span></code></pre></div></div>
<pre><code>## [1] 0.4787307</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Empirical CATT</span></span>
<span id="cb29-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T[W])</span></code></pre></div></div>
<pre><code>## [1] 0.4700207</code></pre>
<p>As above, let’s run the four models. First the causal forest:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1">c.forest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">causal_forest</span>(X, Y, W)</span>
<span id="cb31-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.38286619 0.01456181</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treated'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.45825280 0.01486095</code></pre>
<p>Excellent agreement with the ground truth. Next the&nbsp;<code>dmlmt</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">invisible</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmlmt</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
##  Binary treatment
## 
## 
## 
##  Potential outcomes:
##                    PO     SE
## Treatment 0 -0.400063 0.0132
## Treatment 1 -0.012876 0.0139
## 
## Average effects
##               TE       SE      t         p    
## T1 - T0 0.387187 0.015584 24.846 &lt; 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## # of obs on / off support: 19998  /  2</code></pre>
<p>Also a very good agreement; note however that the&nbsp;<code>dmlmt</code>&nbsp;can only estimate the CATE, not the CATT. Let’s see next the one-model regressor:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1">ranger.model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span>
<span id="cb37-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate</span>(ranger.model, X)</span></code></pre></div></div>
<pre><code>## [1] 0.3420549</code></pre>
<p>Just as in the previous case, this is not too good. Let’s see finally the two-model regressors:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb39-1">model_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[W, ])</span>
<span id="cb39-2">model_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>W, ])</span>
<span id="cb39-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate_two_models</span>(model_treated, model_untreated, X)</span></code></pre></div></div>
<pre><code>## [1] 0.3885038</code></pre>
<p>And the linear regression:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb41-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5  
##  -0.3553287    0.2582300    0.9965371    0.5002037    0.0002953   -0.0066890  
##          X6           X7           X8           X9          X10        WTRUE  
##   0.0058448    0.0001589   -0.0039690   -0.0037644    0.0059526    0.3778513</code></pre>
<p>This, again, is not too bad, especially compared with the one-model regressor.</p>
<p>The previous two examples had relatively simple treatment and main effects. How well do these models perform in more complex situations? To see this I’m going to run them through some of the stress-tests given in section 6 of&nbsp;<a href="https://arxiv.org/abs/1712.04912">Nie and Wager (2020)</a>.</p>
</section>
<section id="case-3-no-confounding-non-trivial-main-effects" class="level3">
<h3 class="anchored" data-anchor-id="case-3-no-confounding-non-trivial-main-effects">Case 3: no confounding, non-trivial main effects</h3>
<p>In this case the treatment assignment is random and we’re essentially running a randomized trial, but with complex main effects:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb43-1">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(</span>
<span id="cb43-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> P),</span>
<span id="cb43-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> N,</span>
<span id="cb43-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> P,</span>
<span id="cb43-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dimnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>P))</span>
<span id="cb43-6">)</span>
<span id="cb43-7">W <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbernoulli</span>(N, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb43-8">T <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]))</span>
<span id="cb43-9">E <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmax</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>], X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmax</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>], <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb43-10">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> T <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> W <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span></code></pre></div></div>
<p>The empirical CATE and CATT are very close, since there’s no confounding:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb44-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T)</span></code></pre></div></div>
<pre><code>## [1] 0.804748</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb46-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T[W])</span></code></pre></div></div>
<pre><code>## [1] 0.8047635</code></pre>
<p>Here are the estimates using a causal forest:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb48" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb48-1">c.forest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">causal_forest</span>(X, Y, W)</span>
<span id="cb48-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.80974803 0.01518978</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb50" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb50-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treated'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.80902777 0.01520083</code></pre>
<p>Next the&nbsp;<code>dmlmt</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb52" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb52-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">invisible</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmlmt</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
##  Binary treatment
## 
## 
## 
##  Potential outcomes:
##                 PO     SE
## Treatment 0 1.3958 0.0139
## Treatment 1 2.2089 0.0179
## 
## Average effects
##              TE      SE     t         p    
## T1 - T0 0.81307 0.01876 43.34 &lt; 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## # of obs on / off support: 19998  /  2</code></pre>
<p>Next the one-model regressor:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb54" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb54-1">ranger.model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span>
<span id="cb54-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate</span>(ranger.model, X)</span></code></pre></div></div>
<pre><code>## [1] 0.7359536</code></pre>
<p>Next the two-model regressors:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb56" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb56-1">model_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[W, ])</span>
<span id="cb56-2">model_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>W, ])</span>
<span id="cb56-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate_two_models</span>(model_treated, model_untreated, X)</span></code></pre></div></div>
<pre><code>## [1] 0.8168369</code></pre>
<p>And finally the linear regression:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb58" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb58-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5  
##    1.396381     0.911487     0.674177     0.356589     0.494779     0.513076  
##          X6           X7           X8           X9          X10        WTRUE  
##   -0.001450    -0.003947     0.006237     0.001400     0.010172     0.812840</code></pre>
<p>All models, except the one-model random forest, perform rather well on this dataset with no confounders.</p>
</section>
<section id="case-4-confounding-non-trivial-main-effects" class="level3">
<h3 class="anchored" data-anchor-id="case-4-confounding-non-trivial-main-effects">Case 4: confounding, non-trivial main effects</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb60" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb60-1">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(</span>
<span id="cb60-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> P),  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># note the uniform distribution</span></span>
<span id="cb60-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> N,</span>
<span id="cb60-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> P,</span>
<span id="cb60-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dimnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>P))</span>
<span id="cb60-6">)</span>
<span id="cb60-7">trim <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x, eta) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmax</span>(eta, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pmin</span>(x, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> eta))</span>
<span id="cb60-8">W <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbernoulli</span>(N, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">trim</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sinpi</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]), <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>))</span>
<span id="cb60-9">T <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> (X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb60-10">E <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sinpi</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>]</span>
<span id="cb60-11">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> T <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> W <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span></code></pre></div></div>
<p>Here are the empirical CATE and CATT:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb61" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb61-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T)</span></code></pre></div></div>
<pre><code>## [1] 0.5006082</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb63" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb63-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(T[W])</span></code></pre></div></div>
<pre><code>## [1] 0.5969957</code></pre>
<p>Here are the estimates using a causal forest:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb65" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb65-1">c.forest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">causal_forest</span>(X, Y, W)</span>
<span id="cb65-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.53728132 0.01858506</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb67" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb67-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treated'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.63199708 0.02226546</code></pre>
<p>Not too bad; perhaps a bit biased on the CATT estimate. Next the&nbsp;<code>dmlmt</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb69" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb69-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">invisible</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmlmt</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
##  Binary treatment
## 
## 
## 
##  Potential outcomes:
##                 PO     SE
## Treatment 0 1.2626 0.0182
## Treatment 1 1.9945 0.0130
## 
## Average effects
##               TE       SE      t         p    
## T1 - T0 0.731896 0.022071 33.161 &lt; 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## # of obs on / off support: 19958  /  42</code></pre>
<p>Here the estimate is way too high. Let’s see now the one-model regressor:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb71" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb71-1">ranger.model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span>
<span id="cb71-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate</span>(ranger.model, X)</span></code></pre></div></div>
<pre><code>## [1] 0.5871552</code></pre>
<p>Next the two-model regressors:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb73" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb73-1">model_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[W, ])</span>
<span id="cb73-2">model_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>W, ])</span>
<span id="cb73-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate_two_models</span>(model_treated, model_untreated, X)</span></code></pre></div></div>
<pre><code>## [1] 0.6317095</code></pre>
<p>And finally the linear regression:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb75" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb75-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5  
##   -0.193808     0.805531     0.822270    -0.053563     1.031979     0.512983  
##          X6           X7           X8           X9          X10        WTRUE  
##   -0.004783    -0.041369     0.019418     0.044691     0.010070     0.705682</code></pre>
<p>Except for the causal forest, all models tend to overestimate the treatment effects.</p>
<p>In the final example, we have a complex confounding and non-trivial main effects, but a trivial treatment effect.</p>
</section>
<section id="case-5-confounding-trivial-treatment-effect-non-trivial-main-effects" class="level3">
<h3 class="anchored" data-anchor-id="case-5-confounding-trivial-treatment-effect-non-trivial-main-effects">Case 5: confounding, trivial treatment effect, non-trivial main effects</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb77" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb77-1">X <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(</span>
<span id="cb77-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> P),</span>
<span id="cb77-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> N,</span>
<span id="cb77-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> P,</span>
<span id="cb77-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dimnames =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'X'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>P))</span>
<span id="cb77-6">)</span>
<span id="cb77-7">W <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbernoulli</span>(N, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>])))</span>
<span id="cb77-8">T <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb77-9">E <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">exp</span>(X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>] <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> X[, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]))</span>
<span id="cb77-10">Y <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> T <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> W <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> E <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(N)</span></code></pre></div></div>
<p>In this case we have CATE = CATT = 1. Let’s see how the causal forest performs:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb78" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb78-1">c.forest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">causal_forest</span>(X, Y, W)</span>
<span id="cb78-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'all'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.94210993 0.01783883</code></pre>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb80" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb80-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">average_treatment_effect</span>(c.forest, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target.sample =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'treated'</span>)</span></code></pre></div></div>
<pre><code>##   estimate    std.err 
## 0.94966382 0.02062717</code></pre>
<p>Pretty good. Next the&nbsp;<code>dmlmt</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb82" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb82-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">invisible</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmlmt</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
##  Binary treatment
## 
## 
## 
##  Potential outcomes:
##                 PO     SE
## Treatment 0 1.9719 0.0254
## Treatment 1 2.9588 0.0273
## 
## Average effects
##               TE       SE      t         p    
## T1 - T0 0.986893 0.032839 30.053 &lt; 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## # of obs on / off support: 19855  /  145</code></pre>
<p>Very good estimates too. How about the one-model regressor?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb84" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb84-1">ranger.model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span>
<span id="cb84-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate</span>(ranger.model, X)</span></code></pre></div></div>
<pre><code>## [1] 0.613593</code></pre>
<p>Oops, not that good. Perhaps the two-model regressor does better?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb86" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb86-1">model_treated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[W, ])</span>
<span id="cb86-2">model_untreated <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y)[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>W, ])</span>
<span id="cb86-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ranger_cate_two_models</span>(model_treated, model_untreated, X)</span></code></pre></div></div>
<pre><code>## [1] 0.7121187</code></pre>
<p>Disaster. And finally the linear regression:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb88" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb88-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(Y <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(X, W, Y))</span></code></pre></div></div>
<pre><code>## 
## Call:
## lm(formula = Y ~ ., data = data.frame(X, W, Y))
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5  
##   1.9846267    1.0057652    0.9917930    1.0018414    0.0025015   -0.0173148  
##          X6           X7           X8           X9          X10        WTRUE  
##  -0.0007228   -0.0001395    0.0019996   -0.0055046   -0.0025001    0.9904264</code></pre>
<p>Much better.</p>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>The following table summarizes the empirical CATE in each case, and the CATE estimated by each algorithm:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Case</th>
<th>CATE</th>
<th>GRF</th>
<th>DMLMT</th>
<th>RF1</th>
<th>RF2</th>
<th>LR</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>0.40</td>
<td>0.39</td>
<td>0.39</td>
<td>0.35</td>
<td>0.39</td>
<td>0.39</td>
</tr>
<tr class="even">
<td>2</td>
<td>0.39</td>
<td>0.38</td>
<td>0.39</td>
<td>0.34</td>
<td>0.39</td>
<td>0.38</td>
</tr>
<tr class="odd">
<td>3</td>
<td>0.80</td>
<td>0.81</td>
<td>0.81</td>
<td>0.74</td>
<td>0.82</td>
<td>0.81</td>
</tr>
<tr class="even">
<td>4</td>
<td>0.50</td>
<td>0.54</td>
<td>0.73</td>
<td>0.59</td>
<td>0.63</td>
<td>0.71</td>
</tr>
<tr class="odd">
<td>5</td>
<td>1.00</td>
<td>0.94</td>
<td>0.99</td>
<td>0.61</td>
<td>0.71</td>
<td>0.99</td>
</tr>
</tbody>
</table>
<p>The causal forest from the&nbsp;<code>grf</code>&nbsp;package&nbsp;<em>always</em>&nbsp;dominates the other methods; like I said at the beginning, I’m not entirely sure why, but this quick study should alert you to the fact that&nbsp;<strong>causal studies are hard</strong>, because unlike traditional regression problems here&nbsp;<strong>you have no ground truth</strong>&nbsp;against which to cross-validate your model. Your regression runs just fine, you can p-value your results all the way to statistical hell, the fact remains that you just don’t know what you are doing.</p>
<p>I would love to hear from anyone who could explain in simple terms why we see such a variety of estimation accuracies.</p>


</section>

 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2021/02/no-you-have-not-controlled-for-confounders/</guid>
  <pubDate>Wed, 10 Feb 2021 00:00:00 GMT</pubDate>
</item>
<item>
  <title>A/B testing my resume</title>
  <link>https://blog.davidlindelof.com/posts/2020/11/a-b-testing-my-resume/</link>
  <description><![CDATA[ 






<p>Internet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis.</p>
<p>Well, let’s fix that.</p>
<p>Being currently open to work, I thought this would be the right time to test this scientifically. I have two versions of my resume:</p>
<ul>
<li><a href="https://davidlindelof.com/wp-content/uploads/2020/11/Lindelof_CV.pdf">A two-page, employment + education on first page, extra information on the second page</a>&nbsp;such as online courses, hobbies etc.</li>
<li><a href="https://davidlindelof.com/wp-content/uploads/2020/11/Lindelof-Resume-Dec-20.pdf">A one-page, dense, responsibilities + achievements only</a>, follows template from the&nbsp;<a href="https://www.manager-tools.com/products/resume-workbook">Career Tools resume workbook</a>.</li>
</ul>
<p>The purpose of a resume is to land you an interview, so we’ll track for each resume how many applications yield a call for an interview. Non-responses after one week are treated as failures. We’ll model the effectiveness of a resume as a binomial distribution: all other things being considered equal, we’ll assume all applications using the same resume type have the same probability (<img src="https://latex.codecogs.com/png.latex?p1">&nbsp;or&nbsp;<img src="https://latex.codecogs.com/png.latex?p2">) of landing an interview. We’d like to estimate these probabilities, and decide if one resume is more effective than the other.</p>
<!--more-->
<p>In a traditional randomized trial, we would randomly assign each job offer to a resume and record the success rate. But let’s estimate the&nbsp;<a href="https://en.wikipedia.org/wiki/Power_of_a_test">statistical power</a>&nbsp;of such a test. From past experience, and also from many plots such as <a href="https://www.reddit.com/r/ProductManagement/comments/j3654d/8_weeks_of_job_search_spree_ended_happily_two/">this one</a>&nbsp;posted on Reddit, it seems reasonable to assign a baseline success rate of about 0.1 (i.e., about one application in 10 yields an interview). Suppose the one-page version is twice as effective and we apply to 100 positions with each. Then the statistical power, i.e.&nbsp;the probability of detecting a statistically significant effect, is given by:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(Exact)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">power.exact.test</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p1 =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p2 =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n1 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n2 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span></code></pre></div></div>
<pre><code>## 
##      Z-pooled Exact Test 
## 
##          n1, n2 = 100, 100
##          p1, p2 = 0.2, 0.1
##           alpha = 0.05
##           power = 0.501577
##     alternative = two.sided
##           delta = 0</code></pre>
<p>That is, we have only about 50% chances of detecting the effect with 0.05 confidence. This is not going to work; at a rate of about 10 applications per month, this would require 20 months.</p>
<p>Instead I’m going to frame this as a&nbsp;<a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">multi-armed bandit</a>&nbsp;problem: I have two resumes and I don’t know which one is the most effective, so I’d like to test them both&nbsp;<em>but</em>&nbsp;give preference to the one that seems to have the highest rate of success—also known as trading off exploration vs exploitation.</p>
<p>We’ll begin by assuming again that we think each has about 10% chance of success, but since this is based on a limited experience it makes sense to treat this probability as the expected value of a beta distribution parameterized by, say, 1 success and 9 failures.</p>
<p>So whenever we apply for a new job, we:</p>
<ul>
<li>draw a new&nbsp;<img src="https://latex.codecogs.com/png.latex?p1">&nbsp;and&nbsp;<img src="https://latex.codecogs.com/png.latex?p2">&nbsp;from each beta distribution</li>
<li>apply to the one with the highest drawn probability</li>
<li>update the selected resume’s beta distribution according to its success or failure.</li>
</ul>
<p>Let’s simulate this, assuming that we know immediately if the application was successful or not. Let’s take the “true” probabilities to be 0.14 and 0.11 for the one-page and two-page resumes respectively. We’ll keep track of the simulation state in a simple list:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">new_stepper <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>() {</span>
<span id="cb3-2">  state <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">k1 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n1 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p1 =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.14</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">k2 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n2 =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p2 =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.11</span>)</span>
<span id="cb3-3">  step <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>() {</span>
<span id="cb3-4">    old_state <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> state</span>
<span id="cb3-5">    state <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">next_state</span>(state)</span>
<span id="cb3-6">    old_state</span>
<span id="cb3-7">  }</span>
<span id="cb3-8">  step</span>
<span id="cb3-9">}</span></code></pre></div></div>
<p><code>new_stepper()</code>&nbsp;returns a closure that keeps a reference to the simulation state. Each call to that closure updates the state using the&nbsp;<code>next_state</code>&nbsp;function:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">next_state <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(state) {</span>
<span id="cb4-2">  p1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbeta</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k1, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k1)</span>
<span id="cb4-3">  p2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbeta</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k2, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k2)</span>
<span id="cb4-4">  pull1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> p1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> p2</span>
<span id="cb4-5">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rbinom</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(pull1, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>p1, state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>p2))</span>
<span id="cb4-6">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (pull1) {</span>
<span id="cb4-7">    state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb4-8">    state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> result</span>
<span id="cb4-9">  } <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb4-10">    state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb4-11">    state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> state<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> result</span>
<span id="cb4-12">  }</span>
<span id="cb4-13">  state</span>
<span id="cb4-14">}</span></code></pre></div></div>
<p>So let’s now simulate 1000 steps:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">step <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">new_stepper</span>()</span>
<span id="cb5-2">sim <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">t</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlist</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">step</span>()))))</span></code></pre></div></div>
<p>The estimated effectiveness of each resume is given by the number of successes divided by the number of applications made with that resume:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>one_page <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n1</span>
<span id="cb6-2">sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>two_page <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>k2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>n2</span>
<span id="cb6-3">sim<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>id <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(sim)</span></code></pre></div></div>
<p>The follow plot shows how that estimated probability evolves over time:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(reshape2)</span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb7-3">sim_long <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">melt</span>(sim, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">measure.vars =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'one_page'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'two_page'</span>))</span>
<span id="cb7-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(sim_long, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> value, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> variable)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">xlab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Applications'</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ylab</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Estimated probability of success'</span>)</span></code></pre></div></div>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2020/11/a-b-testing-my-resume/images/image-5-1024x731.png" class="img-fluid figure-img"></p>
<figcaption>
<p><em>Wouldn’t that be nice</em></p>
</figcaption>
</figure>
<p>As you can see, the algorithm decides pretty rapidly (after about 70 applications) that the one-page resume is more effective.</p>
<p>So here’s the protocol I’ve begun to follow since about mid-November:</p>
<ul>
<li>Apply only to jobs that I would normally have applied to</li>
<li>Go through the entire application procedure, including writing cover letter etc, until uploading the resume becomes unavoidable (I do this mainly to avoid any personal bias when writing cover letters)</li>
<li>Draw&nbsp;<img src="https://latex.codecogs.com/png.latex?p1">&nbsp;and&nbsp;<img src="https://latex.codecogs.com/png.latex?p2">&nbsp;as described above; select resume type with highest&nbsp;<img src="https://latex.codecogs.com/png.latex?p"></li>
<li>Adjust the resume according to the job requirements, but keep the changes to a minimum and don’t change the overall format</li>
<li>Finish the application, and record a failure until a call for an interview comes in.</li>
</ul>
<p>I’ll be sure to report on the results in a future blog post.</p>



 ]]></description>
  <category>r</category>
  <guid>https://blog.davidlindelof.com/posts/2020/11/a-b-testing-my-resume/</guid>
  <pubDate>Tue, 24 Nov 2020 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Unit testing SQL with PySpark</title>
  <link>https://blog.davidlindelof.com/posts/2020/11/unit-testing-sql-with-pyspark/</link>
  <description><![CDATA[ 






<p>Machine-learning applications frequently feature SQL queries, which range from simple projections to complex aggregations over several join operations.</p>
<p>There doesn’t seem to be much guidance on how to verify that these queries are correct. All mainstream programming languages have embraced unit tests as the primary tool to verify the correctness of the language’s smallest building blocks—all, that is, except SQL.</p>
<p>And yet, SQL is a programming language and SQL queries are computer programs, which should be tested just like every other unit of the application.</p>
<figure class="figure">
<p><img src="https://blog.davidlindelof.com/posts/2020/11/unit-testing-sql-with-pyspark/images/image-4.png" class="img-fluid figure-img"></p>
<figcaption>
<p><em>I’m not responsible</em></p>
</figcaption>
</figure>
<p>All mainstream languages have libraries for writing unit tests: small computer programs that verify that each software module works as expected. But SQL poses a special challenge, as it can be difficult to use SQL to set up a test, execute it, and verify the output. SQL is a declarative language, usually embedded in a “host” programming language—a language in a language.</p>
<p>So to unit test SQL we need to use that host language to set up the data tables used in our queries, orchestrate the execution of the SQL queries, and verify the correctness of the results.</p>
<!--more-->
<p>One additional complication is that every relational database system defines its own SQL dialect, so that a query that runs fine on system A might not even parse on system B. Therefore, as much as technically feasible, we’ll prefer database systems that can be instantiated in memory during unit tests, but are otherwise the same as those running in production. Oracle and Teradata users, I have no idea if what follows will work for you.</p>
<p>Many machine-learning applications use the <a href="https://spark.apache.org/">Apache Spark</a> engine to collect and aggregate data from (possibly huge) datasets; it has bindings to several programming languages but also offers an SQL interface. And you can easily start an instance on your local machine for testing and development. Therefore, in this piece we’ll use PySpark (a Python binding for Spark) to prepare our data in a desired state, execute SQL code against it, and verify the results.</p>
<p>But first things first. We begin by installing the required Python packages:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1">  <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">$</span> pip install ipython pyspark pytest pandas numpy</span></code></pre></div></div>
<p>Before we do anything fancy, let’s make sure we understand how to run SQL code against a Spark session. We’ll write everything as PyTest unit tests, starting with a short test that will send <code>SELECT 1</code>, convert the result to a Pandas <code>DataFrame</code>, and check the results:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_can_send_sql_to_spark():</span>
<span id="cb2-5">    spark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (SparkSession</span>
<span id="cb2-6">             .builder</span>
<span id="cb2-7">             .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"utsql"</span>)</span>
<span id="cb2-8">             .getOrCreate())</span>
<span id="cb2-9">    df: pd.DataFrame <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT 1"</span>).toPandas()</span>
<span id="cb2-10"></span>
<span id="cb2-11">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb2-12">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df.columns) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb2-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.iloc[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span></code></pre></div></div>
<p>We verify that the tests pass:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">$</span> pytest <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--disable-warnings</span></span>
<span id="cb3-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">=============================</span> test session starts =============================</span>
<span id="cb3-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">platform</span> darwin <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--</span> Python 3.7.5, pytest-5.1.2, py-1.8.0, pluggy-0.13.0</span>
<span id="cb3-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">rootdir:</span> /Users/dlindelof/Work/app/utsql</span>
<span id="cb3-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">collected</span> 1 item</span>
<span id="cb3-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">1/test_spark_api.py</span> . <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">100%</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb3-7"></span>
<span id="cb3-8"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">========================</span> 1 passed, 7 warnings in 4.82s ========================</span></code></pre></div></div>
<p>You’re right—4.82 seconds is an awfully long time for a single unit test. But most of that time is spent instantiating Spark, and will therefore be shared by all the tests we write.</p>
<p>To write more interesting queries we’ll have to populate our Spark session with data. The fundamental building block of PySpark’s API is the Spark&nbsp;<code>DataFrame</code>&nbsp;(not to be confused with Pandas’&nbsp;<code>DataFrame</code>), which you can think of as a distributed table. A Spark&nbsp;<code>DataFrame</code>&nbsp;can be created in many ways; a very convenient one is from a list of dictionaries:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> d <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Alice'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'age'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>}]</span>
<span id="cb4-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame(d)</span>
<span id="cb4-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> df</span>
<span id="cb4-4">DataFrame[age: bigint, name: string]</span>
<span id="cb4-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> df.collect()</span>
<span id="cb4-6">[Row(age<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Alice'</span>)]</span>
<span id="cb4-7"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span></span></code></pre></div></div>
<p>As you can see, Spark does a pretty good job at inferring the data types from the dicts you provide, albeit that behaviour used to be deprecated.</p>
<p>You cannot yet run SQL queries against this data frame, because no table exists that your SQL queries can refer to. To do that, use the&nbsp;<code>createOrReplaceTempView()</code>&nbsp;method:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> df.createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'people'</span>)</span>
<span id="cb5-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> spark.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT * FROM people"</span>).toPandas()</span>
<span id="cb5-3">   age   name</span>
<span id="cb5-4"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>    <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>  Alice</span></code></pre></div></div>
<p>That SQL query returned a data frame with just one row, with the data we provided. We didn’t need to write a table schema, as Spark inferred it for us. Before we move on, let’s capture what we have learned in a unit test.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb6-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb6-3"></span>
<span id="cb6-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_can_create_sql_table():</span>
<span id="cb6-5">    spark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (SparkSession</span>
<span id="cb6-6">             .builder</span>
<span id="cb6-7">             .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"utsql"</span>)</span>
<span id="cb6-8">             .getOrCreate())</span>
<span id="cb6-9"></span>
<span id="cb6-10">    d <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Alice'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'age'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>}]</span>
<span id="cb6-11">    expected_pdf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(d)</span>
<span id="cb6-12"></span>
<span id="cb6-13">    sdf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame(d)</span>
<span id="cb6-14">    sdf.createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'people'</span>)</span>
<span id="cb6-15">    actual_pdf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT name, age FROM people"</span>).toPandas()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We need to be explicit about how the columns are ordered</span></span>
<span id="cb6-16"></span>
<span id="cb6-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> expected_pdf.equals(actual_pdf)</span>
<span id="cb6-18"></span>
<span id="cb6-19">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'people'</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Delete the table after we’re done</span></span></code></pre></div></div>
<p>That’s it, really; we now know how to prepare a database with tables and rows of data; we know how to run SQL queries against it; and we know how to check assertions on the rows returned by the database. You can probably stop here and put this to use on your project, but if you’ll bear with me I’d like to walk you through a little non-trivial example.</p>
<p>Let’s say we run a book publishing company. We keep track of authors, titles, and sales. We’d like to list all authors, together with any book they may have (co-)authored that has sold more than 1000 copies. We’re going to see whether we can craft such a query using the equivalent of Test-Driven Development for SQL. The SQL query itself will be held in a string called&nbsp;<code>QUERY</code>.</p>
<p>Let’s assume the production database consists of three tables, defined as follows:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb7-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">CREATE</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">TABLE</span> authors (</span>
<span id="cb7-2">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">id</span> SERIAL <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">PRIMARY</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">KEY</span>,</span>
<span id="cb7-3">  name <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">varchar</span></span>
<span id="cb7-4">);</span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">CREATE</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">TABLE</span> books (</span>
<span id="cb7-7">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">id</span> SERIAL <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">PRIMARY</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">KEY</span>,</span>
<span id="cb7-8">  title <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">varchar</span></span>
<span id="cb7-9">);</span>
<span id="cb7-10"></span>
<span id="cb7-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">CREATE</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">TABLE</span> authorships (</span>
<span id="cb7-12">  authorid <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">INTEGER</span>,</span>
<span id="cb7-13">  bookid <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">INTEGER</span>,</span>
<span id="cb7-14">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">FOREIGN</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">KEY</span> (authorid) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">REFERENCES</span> authors(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">id</span>),</span>
<span id="cb7-15">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">FOREIGN</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">KEY</span> (bookid) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">REFERENCES</span> books(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">id</span>)</span>
<span id="cb7-16">);</span>
<span id="cb7-17"></span>
<span id="cb7-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">CREATE</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">TABLE</span> sales (</span>
<span id="cb7-19">  bookid <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">INTEGER</span>,</span>
<span id="cb7-20">  sales <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">INTEGER</span>,</span>
<span id="cb7-21">  <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">FOREIGN</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">KEY</span> (bookid) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">REFERENCES</span> books(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">id</span>)</span>
<span id="cb7-22">);</span></code></pre></div></div>
<p>The&nbsp;<code>authorships</code>&nbsp;table keeps track of which author (co-)authored which book, and is necessary due to the many-to-many relationship between authors and titles.</p>
<p>In TDD one always starts with the simplest case first. That’s frequently the degenerate case, so we’re simply going to check that we return an empty data frame when we have no published authors. We begin by setting up an empty table of authors, which lets me introduce another handy technique: setting up an empty table conforming to a given schema.</p>
<p>In most database systems you can easily create an empty table by issuing the right&nbsp;<code>CREATE TABLE</code>&nbsp;statement. But to do so in PySpark you need to have Hive support, which you probably don’t have on your local machine, and we won’t cover here. We&nbsp;<em>could</em>&nbsp;specify the schema manually via&nbsp;<code>StructType</code>, but see how ungainly this becomes, even for just one column:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pyspark.sql.types <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> T</span>
<span id="cb8-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>T.StructType([T.StructField(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>, T.StringType(), <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)]))</span>
<span id="cb8-3">Out[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>]: DataFrame[name: string]</span></code></pre></div></div>
<p>Instead, we’ll do it in two steps: first, create a one-row data frame with data that&nbsp;<em>could</em>&nbsp;have come from that table (a&nbsp;<em>prototype</em>); then, create an empty data frame, but specify that its schema must be the same as the prototype’s:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}])</span>
<span id="cb9-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> empty_authors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb9-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> empty_authors</span>
<span id="cb9-4">Out[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">31</span>]: DataFrame[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span>: bigint, name: string]</span>
<span id="cb9-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> empty_authors.show()</span>
<span id="cb9-6"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+---+----+</span></span>
<span id="cb9-7"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">id</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span>name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span></span>
<span id="cb9-8"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+---+----+</span></span>
<span id="cb9-9"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+---+----+</span></span></code></pre></div></div>
<p>So let’s write that test:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb10-2"></span>
<span id="cb10-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors():</span>
<span id="cb10-4">    spark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (SparkSession</span>
<span id="cb10-5">             .builder</span>
<span id="cb10-6">             .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"utsql"</span>)</span>
<span id="cb10-7">             .getOrCreate())</span>
<span id="cb10-8">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}])</span>
<span id="cb10-9">    empty_authors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb10-10">    empty_authors.createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb10-11"></span>
<span id="cb10-12">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb10-13"></span>
<span id="cb10-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb10-15"></span>
<span id="cb10-16">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span></code></pre></div></div>
<p>Running the test fails because we haven’t defined&nbsp;<code>QUERY</code>:</p>
<pre><code>FAILED utsql/2/test_authors.py::test_empty_database_yields_no_authors - NameError: name 'QUERY' is not defined</code></pre>
<p>So let’s populate&nbsp;<code>QUERY</code>&nbsp;with the simplest SQL code that returns an empty table:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb12-2"></span>
<span id="cb12-3">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT 1 WHERE 1 = 0"</span></span>
<span id="cb12-4"></span>
<span id="cb12-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors():</span>
<span id="cb12-6">    spark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (SparkSession</span>
<span id="cb12-7">             .builder</span>
<span id="cb12-8">             .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"utsql"</span>)</span>
<span id="cb12-9">             .getOrCreate())</span>
<span id="cb12-10">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}])</span>
<span id="cb12-11">    empty_authors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb12-12">    empty_authors.createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb12-13"></span>
<span id="cb12-14">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb12-15"></span>
<span id="cb12-16">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb12-17"></span>
<span id="cb12-18">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span></code></pre></div></div>
<p>The tests pass now, but the unit test has become somewhat ungainly: there’s code to create a Spark session and to create an empty table, which we are going to need over and over again. Let’s turn that&nbsp;<code>SparkSession</code>&nbsp;object into a PyTest fixture: part of the scaffolding that you can define for your tests:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb13-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb13-3"></span>
<span id="cb13-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT 1 WHERE 1 = 0"</span></span>
<span id="cb13-5"></span>
<span id="cb13-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb13-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark():</span>
<span id="cb13-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb13-9">            .builder</span>
<span id="cb13-10">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb13-11">            .getOrCreate())</span>
<span id="cb13-12"></span>
<span id="cb13-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb13-14">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}])</span>
<span id="cb13-15">    empty_authors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb13-16">    empty_authors.createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb13-17"></span>
<span id="cb13-18">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb13-19"></span>
<span id="cb13-20">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb13-21"></span>
<span id="cb13-22">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span></code></pre></div></div>
<p>The tests still pass. Good, now let’s factor out the code that creates an empty table:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb14-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb14-3"></span>
<span id="cb14-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT 1 WHERE 1 = 0"</span></span>
<span id="cb14-5"></span>
<span id="cb14-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb14-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark():</span>
<span id="cb14-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb14-9">            .builder</span>
<span id="cb14-10">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb14-11">            .getOrCreate())</span>
<span id="cb14-12"></span>
<span id="cb14-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb14-14">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb14-15">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb14-16">    empty.createOrReplaceTempView(name)</span>
<span id="cb14-17"></span>
<span id="cb14-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb14-19">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark)</span>
<span id="cb14-20"></span>
<span id="cb14-21">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb14-22"></span>
<span id="cb14-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb14-24"></span>
<span id="cb14-25">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span></code></pre></div></div>
<p>Great, the tests still pass. We have now a utility function for creating empty tables of arbitrary schemas.</p>
<p>So let’s now implement the first case that will&nbsp;<em>force</em>&nbsp;us to change our query: one author has written one book that has sold 1000 copies. The query should return a single row with that author and the book. First we need to populate the database that reflect the tables in the “real” database, and change our assertion:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb15-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb15-3"></span>
<span id="cb15-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SELECT 1 WHERE 1 = 0"</span></span>
<span id="cb15-5"></span>
<span id="cb15-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb15-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark():</span>
<span id="cb15-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb15-9">            .builder</span>
<span id="cb15-10">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb15-11">            .getOrCreate())</span>
<span id="cb15-12"></span>
<span id="cb15-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb15-14">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb15-15">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb15-16">    empty.createOrReplaceTempView(name)</span>
<span id="cb15-17"></span>
<span id="cb15-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb15-19">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark)</span>
<span id="cb15-20"></span>
<span id="cb15-21">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb15-22"></span>
<span id="cb15-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb15-24"></span>
<span id="cb15-25">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb15-26"></span>
<span id="cb15-27"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb15-28">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb15-29">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb15-30"></span>
<span id="cb15-31">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb15-32">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb15-33"></span>
<span id="cb15-34">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb15-35">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb15-36"></span>
<span id="cb15-37">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb15-38">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb15-39"></span>
<span id="cb15-40">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb15-41"></span>
<span id="cb15-42">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb15-43"></span>
<span id="cb15-44">    [spark.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>]]</span></code></pre></div></div>
<p>Running these tests fails: the data frame that’s returned has 0 rows and we expect 1. Let’s do the simplest fix (cheat?) that will pass the test while keeping the other one passing:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb16-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb16-3"></span>
<span id="cb16-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb16-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT 1</span></span>
<span id="cb16-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb16-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb16-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb16-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb16-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id"""</span></span>
<span id="cb16-11"></span>
<span id="cb16-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb16-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark():</span>
<span id="cb16-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb16-15">            .builder</span>
<span id="cb16-16">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb16-17">            .getOrCreate())</span>
<span id="cb16-18"></span>
<span id="cb16-19"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb16-20">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb16-21">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb16-22">    empty.createOrReplaceTempView(name)</span>
<span id="cb16-23"></span>
<span id="cb16-24"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb16-25">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark)</span>
<span id="cb16-26"></span>
<span id="cb16-27">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb16-28"></span>
<span id="cb16-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb16-30"></span>
<span id="cb16-31">    spark.catalog.dropTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb16-32"></span>
<span id="cb16-33"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb16-34">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb16-35">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb16-36"></span>
<span id="cb16-37">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb16-38">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb16-39"></span>
<span id="cb16-40">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb16-41">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb16-42"></span>
<span id="cb16-43">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb16-44">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb16-45"></span>
<span id="cb16-46">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb16-47"></span>
<span id="cb16-48">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb16-49"></span>
<span id="cb16-50">    [spark.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>]]</span></code></pre></div></div>
<p>The second test passes, but the first one fails because no&nbsp;<code>books</code>&nbsp;table exists. One feels that perhaps we should populate the Spark session with empty tables just after creation. I’m going to do that but in a slightly different way. I’ll keep a&nbsp;<code>module</code>-scoped function that creates the Spark session, but for each test we’re going to populate that session with empty tables, yield the session, and clean up afterwards. That way, any test is free to update the tables it needs, confident that the others will be present but empty:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb17-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb17-3"></span>
<span id="cb17-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb17-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT 1</span></span>
<span id="cb17-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb17-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb17-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb17-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb17-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id"""</span></span>
<span id="cb17-11"></span>
<span id="cb17-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb17-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark_session():</span>
<span id="cb17-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb17-15">            .builder</span>
<span id="cb17-16">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb17-17">            .getOrCreate())</span>
<span id="cb17-18"></span>
<span id="cb17-19"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>()</span>
<span id="cb17-20"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark(spark_session):</span>
<span id="cb17-21">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb17-22">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb17-23">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb17-24">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb17-25"></span>
<span id="cb17-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> spark_session</span>
<span id="cb17-27"></span>
<span id="cb17-28">    [spark_session.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)]</span>
<span id="cb17-29"></span>
<span id="cb17-30"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb17-31">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb17-32">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb17-33">    empty.createOrReplaceTempView(name)</span>
<span id="cb17-34"></span>
<span id="cb17-35"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb17-36">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb17-37"></span>
<span id="cb17-38">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb17-39"></span>
<span id="cb17-40"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb17-41">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb17-42">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb17-43"></span>
<span id="cb17-44">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb17-45">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb17-46"></span>
<span id="cb17-47">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb17-48">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb17-49"></span>
<span id="cb17-50">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb17-51">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb17-52"></span>
<span id="cb17-53">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb17-54"></span>
<span id="cb17-55">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span></code></pre></div></div>
<p>All tests pass now and the code is well-factored. But you’re probably horrified about that literal&nbsp;<code>1</code>&nbsp;in the&nbsp;<code>SELECT</code>&nbsp;clause. We wanted a list of titles and author names, so let’s amend our test and our query to ensure we test for that:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb18-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb18-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb18-4"></span>
<span id="cb18-5">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb18-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT title, name</span></span>
<span id="cb18-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb18-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb18-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb18-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb18-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id"""</span></span>
<span id="cb18-12"></span>
<span id="cb18-13"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb18-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark_session():</span>
<span id="cb18-15">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb18-16">            .builder</span>
<span id="cb18-17">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb18-18">            .getOrCreate())</span>
<span id="cb18-19"></span>
<span id="cb18-20"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>()</span>
<span id="cb18-21"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark(spark_session):</span>
<span id="cb18-22">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb18-23">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb18-24">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb18-25">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb18-26"></span>
<span id="cb18-27">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> spark_session</span>
<span id="cb18-28"></span>
<span id="cb18-29">    [spark_session.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)]</span>
<span id="cb18-30"></span>
<span id="cb18-31"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb18-32">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb18-33">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb18-34">    empty.createOrReplaceTempView(name)</span>
<span id="cb18-35"></span>
<span id="cb18-36"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb18-37">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb18-38"></span>
<span id="cb18-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb18-40"></span>
<span id="cb18-41"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb18-42">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb18-43">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb18-44"></span>
<span id="cb18-45">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb18-46">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb18-47"></span>
<span id="cb18-48">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb18-49">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb18-50"></span>
<span id="cb18-51">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb18-52">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb18-53"></span>
<span id="cb18-54">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb18-55"></span>
<span id="cb18-56">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb18-57">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">all</span>(df.columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>])</span></code></pre></div></div>
<p>Let’s move on to the next test, where we’ll be forced to filter out sales of less than 1000 units:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb19-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb19-3"></span>
<span id="cb19-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb19-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT title, name</span></span>
<span id="cb19-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb19-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb19-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb19-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb19-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id"""</span></span>
<span id="cb19-11"></span>
<span id="cb19-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb19-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark_session():</span>
<span id="cb19-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb19-15">            .builder</span>
<span id="cb19-16">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb19-17">            .getOrCreate())</span>
<span id="cb19-18"></span>
<span id="cb19-19"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>()</span>
<span id="cb19-20"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark(spark_session):</span>
<span id="cb19-21">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb19-22">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb19-23">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb19-24">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb19-25"></span>
<span id="cb19-26">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> spark_session</span>
<span id="cb19-27"></span>
<span id="cb19-28">    [spark_session.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)]</span>
<span id="cb19-29"></span>
<span id="cb19-30"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb19-31">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb19-32">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb19-33">    empty.createOrReplaceTempView(name)</span>
<span id="cb19-34"></span>
<span id="cb19-35"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb19-36">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb19-37"></span>
<span id="cb19-38">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb19-39"></span>
<span id="cb19-40"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb19-41">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb19-42">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb19-43"></span>
<span id="cb19-44">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb19-45">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb19-46"></span>
<span id="cb19-47">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb19-48">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb19-49"></span>
<span id="cb19-50">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb19-51">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb19-52"></span>
<span id="cb19-53">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb19-54"></span>
<span id="cb19-55">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb19-56"></span>
<span id="cb19-57"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_less_than_1000_copies_yields_empty_table(spark):</span>
<span id="cb19-58">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb19-59">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb19-60"></span>
<span id="cb19-61">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb19-62">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb19-63"></span>
<span id="cb19-64">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb19-65">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb19-66"></span>
<span id="cb19-67">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">999</span>}</span>
<span id="cb19-68">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb19-69"></span>
<span id="cb19-70">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb19-71"></span>
<span id="cb19-72">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span></code></pre></div></div>
<p>The first two tests still pass, but the new one fails because it returns a one-row data frame. Let’s add the missing&nbsp;<code>WHERE</code>&nbsp;clause to the query:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb20-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb20-3"></span>
<span id="cb20-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb20-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT title, name</span></span>
<span id="cb20-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb20-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb20-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb20-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb20-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id</span></span>
<span id="cb20-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN sales</span></span>
<span id="cb20-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = sales.bookid</span></span>
<span id="cb20-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">WHERE sales &gt;= 1000"""</span></span>
<span id="cb20-14"></span>
<span id="cb20-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb20-16"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark_session():</span>
<span id="cb20-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb20-18">            .builder</span>
<span id="cb20-19">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb20-20">            .getOrCreate())</span>
<span id="cb20-21"></span>
<span id="cb20-22"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>()</span>
<span id="cb20-23"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark(spark_session):</span>
<span id="cb20-24">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb20-25">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb20-26">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb20-27">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb20-28"></span>
<span id="cb20-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> spark_session</span>
<span id="cb20-30"></span>
<span id="cb20-31">    [spark_session.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)]</span>
<span id="cb20-32"></span>
<span id="cb20-33"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb20-34">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb20-35">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb20-36">    empty.createOrReplaceTempView(name)</span>
<span id="cb20-37"></span>
<span id="cb20-38"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb20-39">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb20-40"></span>
<span id="cb20-41">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb20-42"></span>
<span id="cb20-43"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb20-44">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb20-45">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb20-46"></span>
<span id="cb20-47">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb20-48">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb20-49"></span>
<span id="cb20-50">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb20-51">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb20-52"></span>
<span id="cb20-53">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb20-54">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb20-55"></span>
<span id="cb20-56">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb20-57"></span>
<span id="cb20-58">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb20-59"></span>
<span id="cb20-60"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_less_than_1000_copies_yields_empty_table(spark):</span>
<span id="cb20-61">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb20-62">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb20-63"></span>
<span id="cb20-64">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb20-65">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb20-66"></span>
<span id="cb20-67">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb20-68">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb20-69"></span>
<span id="cb20-70">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">999</span>}</span>
<span id="cb20-71">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb20-72"></span>
<span id="cb20-73">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb20-74"></span>
<span id="cb20-75">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span></code></pre></div></div>
<p>And the tests pass. We’re probably done at this point, but astute readers will have noted that&nbsp;<em>Programming Pearl</em>&nbsp;was actually co-written by Larry Wall and Randal L. Schwartz, so let’s verify that our query also works for multi-author works:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pytest</span>
<span id="cb21-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pyspark.sql <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SparkSession</span>
<span id="cb21-3"></span>
<span id="cb21-4">QUERY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb21-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">SELECT title, name</span></span>
<span id="cb21-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">FROM books</span></span>
<span id="cb21-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authorships</span></span>
<span id="cb21-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = authorships.bookid</span></span>
<span id="cb21-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN authors</span></span>
<span id="cb21-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON authorships.authorid = authors.id</span></span>
<span id="cb21-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">JOIN sales</span></span>
<span id="cb21-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ON books.id = sales.bookid</span></span>
<span id="cb21-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">WHERE sales &gt;= 1000"""</span></span>
<span id="cb21-14"></span>
<span id="cb21-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>(scope<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'module'</span>)</span>
<span id="cb21-16"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark_session():</span>
<span id="cb21-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (SparkSession</span>
<span id="cb21-18">            .builder</span>
<span id="cb21-19">            .appName(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'utsql'</span>)</span>
<span id="cb21-20">            .getOrCreate())</span>
<span id="cb21-21"></span>
<span id="cb21-22"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@pytest.fixture</span>()</span>
<span id="cb21-23"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> spark(spark_session):</span>
<span id="cb21-24">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb21-25">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb21-26">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb21-27">    create_empty_table(like<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>}, name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>, spark<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>spark_session)</span>
<span id="cb21-28"></span>
<span id="cb21-29">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">yield</span> spark_session</span>
<span id="cb21-30"></span>
<span id="cb21-31">    [spark_session.catalog.dropTempView(table) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> table <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)]</span>
<span id="cb21-32"></span>
<span id="cb21-33"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> create_empty_table(like: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>, name: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>, spark: SparkSession):</span>
<span id="cb21-34">    template <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([like])</span>
<span id="cb21-35">    empty <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.createDataFrame([], schema<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>template.schema)</span>
<span id="cb21-36">    empty.createOrReplaceTempView(name)</span>
<span id="cb21-37"></span>
<span id="cb21-38"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_empty_database_yields_no_authors(spark):</span>
<span id="cb21-39">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb21-40"></span>
<span id="cb21-41">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> df.empty</span>
<span id="cb21-42"></span>
<span id="cb21-43"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_more_than_1000_copies_yields_single_row(spark):</span>
<span id="cb21-44">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb21-45">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb21-46"></span>
<span id="cb21-47">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb21-48">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb21-49"></span>
<span id="cb21-50">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb21-51">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb21-52"></span>
<span id="cb21-53">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb21-54">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb21-55"></span>
<span id="cb21-56">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb21-57"></span>
<span id="cb21-58">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb21-59"></span>
<span id="cb21-60"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_single_book_that_sold_less_than_1000_copies_yields_empty_table(spark):</span>
<span id="cb21-61">    author <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb21-62">    spark.createDataFrame([author]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb21-63"></span>
<span id="cb21-64">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Perl'</span>}</span>
<span id="cb21-65">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb21-66"></span>
<span id="cb21-67">    authorship <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb21-68">    spark.createDataFrame([authorship]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb21-69"></span>
<span id="cb21-70">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">999</span>}</span>
<span id="cb21-71">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb21-72"></span>
<span id="cb21-73">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb21-74"></span>
<span id="cb21-75">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb21-76"></span>
<span id="cb21-77"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> test_book_with_two_authors_that_sold_more_than_1000_copies_yields_two_rows(spark):</span>
<span id="cb21-78">    author1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Larry Wall'</span>}</span>
<span id="cb21-79">    author2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Randal L. Schwartz'</span>}</span>
<span id="cb21-80">    spark.createDataFrame([author1, author2]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authors'</span>)</span>
<span id="cb21-81"></span>
<span id="cb21-82">    book <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'title'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Programming Pearl'</span>}</span>
<span id="cb21-83">    spark.createDataFrame([book]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'books'</span>)</span>
<span id="cb21-84"></span>
<span id="cb21-85">    authorship1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author1[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb21-86">    authorship2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorid'</span>: author2[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>]}</span>
<span id="cb21-87">    spark.createDataFrame([authorship1, authorship2]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'authorships'</span>)</span>
<span id="cb21-88"></span>
<span id="cb21-89">    sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bookid'</span>: book[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'id'</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>}</span>
<span id="cb21-90">    spark.createDataFrame([sales]).createOrReplaceTempView(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sales'</span>)</span>
<span id="cb21-91"></span>
<span id="cb21-92">    df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> spark.sql(QUERY).toPandas()</span>
<span id="cb21-93"></span>
<span id="cb21-94">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb21-95">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">assert</span> {author1[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>], author2[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>]} <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">set</span>(df.name)</span></code></pre></div></div>
<p>And the tests still pass. That’s probably enough testing for this simple use case, but I’m sure you can imagine far more complex scenarios. For example, one can easily use the&nbsp;<a href="https://hypothesis.readthedocs.io/en/latest/">Hypothesis</a>&nbsp;package to generate random tables, run the query, and programmatically verify that the output satisfies the desired property. But that’s a post for another day.</p>



 ]]></description>
  <category>python</category>
  <guid>https://blog.davidlindelof.com/posts/2020/11/unit-testing-sql-with-pyspark/</guid>
  <pubDate>Mon, 16 Nov 2020 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Scraping real estate for fun</title>
  <link>https://blog.davidlindelof.com/posts/2020/11/scraping-real-estate-for-fun-and-profit/</link>
  <description><![CDATA[ 






<p>Here’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the <a href="https://scrapy.org/">Scrapy</a> Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell you which one—not sure they would love me for it).</p>
<p>After about 10 minutes I had the data for 12’124 apartments or houses for sale across Switzerland, with room count, area, price, city, and canton.</p>
<p>I’ve imported the data in R, and log-transformed the room count, area, and price because of extreme skewness. Here’s the resulting scatterplot matrix, obtained with <code>ggpairs()</code>:</p>
<p><img src="https://blog.davidlindelof.com/posts/2020/11/scraping-real-estate-for-fun-and-profit/images/image-3-1024x629.png" class="img-fluid"></p>
<p>There’s a number of interesting features, even from this raw, unclean dataset:</p>
<ul>
<li>there are about twice as many apartments for sale than houses</li>
<li>the room count comes in discrete values in steps of 0.5 (half rooms are frequently used for “smaller” rooms such as a small kitchen, a small hallway, etc)</li>
<li>the room count is highly correlated with area, as expected</li>
<li>the price is more correlated with the area than with the room count</li>
<li>there are several extreme outliers:
<ul>
<li>a property with 290 rooms (was a typo; the owner meant an <em>area</em> of 290 m2)</li>
<li>some properties with abnormally low area (one of them was a house with a listed room count of 1 and area of 1 m2-–obviously didn’t bother to enter correct data)</li>
<li>and more interesting, several properties with abnormally low prices; the lowest-priced item is a 3.5-room, 80 m2 apartment in Fribourg priced at CHF 99.-.</li>
</ul></li>
</ul>
<p>Before we go any further, we’ll obviously have to clean up these faulty data points. There doesn’t seem to be many of them so I’ll do that manually, and write a follow-up post if I find anything interesting.</p>



 ]]></description>
  <guid>https://blog.davidlindelof.com/posts/2020/11/scraping-real-estate-for-fun-and-profit/</guid>
  <pubDate>Fri, 06 Nov 2020 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
