<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.0">Jekyll</generator><link href="https://dglencross.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dglencross.com/" rel="alternate" type="text/html" /><updated>2022-04-13T11:41:15+00:00</updated><id>https://dglencross.com/feed.xml</id><title type="html">Dave Glencross</title><subtitle>The home page of Dave Glencross</subtitle><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><entry><title type="html">Best Darknet Diaries episodes</title><link href="https://dglencross.com/podcasts/darknetdiaries/" rel="alternate" type="text/html" title="Best Darknet Diaries episodes" /><published>2022-04-13T00:00:00+00:00</published><updated>2022-04-13T00:00:00+00:00</updated><id>https://dglencross.com/podcasts/darknetdiaries</id><content type="html" xml:base="https://dglencross.com/podcasts/darknetdiaries/">&lt;p&gt;Having gone through almost the entire backlog of Darknet Diaries, these were my favourite episodes and the ones I recommend to people.&lt;/p&gt;

&lt;p&gt;I particularly enjoy the darknet marketplace episodes, and people hacking computer games or websites. I generally don’t enjoy the episodes with penetration testers quite as much as a rule, but they are still well worth listening to. The worst episodes of Darknet Diaries are still really good. It is possibly my favourite podcast (either that or Hardcore History).&lt;/p&gt;

&lt;p&gt;I’m up to episode 111 at time of writing!&lt;/p&gt;

&lt;p&gt;These are my absolute favourites:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/7/&quot;&gt;EP 7: Manfred (Part 1)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/8/&quot;&gt;EP 8: Manfred (Part 2)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/9/&quot;&gt;EP 9: The Rise and Fall of Mt. Gox&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/24/&quot;&gt;EP 24: Operation Bayonet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/29/&quot;&gt;EP 29: Stuxnet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/99/&quot;&gt;EP 99: The Spy&lt;/a&gt; and &lt;a href=&quot;https://darknetdiaries.com/episode/100/&quot;&gt;EP 100: NSO&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/101/&quot;&gt;EP 101: Lotería&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/102/&quot;&gt;EP 102: Money Maker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/109/&quot;&gt;EP 109: TeaMp0isoN&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And these ones are also great:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/6/&quot;&gt;EP 6: The Beirut Bank Job&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/16/&quot;&gt;EP 16: Eijah&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/17/&quot;&gt;EP 17: Finn&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/20/&quot;&gt;EP 20: mobman&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/30/&quot;&gt;EP 30: Shamoon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/36/&quot;&gt;EP 36: Jeremy from Marketing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/45/&quot;&gt;EP 45: XBox Underground (Part 1)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/46/&quot;&gt;EP 46: XBox Underground (Part 2)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/53/&quot;&gt;EP 53: Shadow Brokers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/54/&quot;&gt;EP 54: NotPetya&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/58/&quot;&gt;EP 58: OxyMonster&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/61/&quot;&gt;EP 61: Samy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/74/&quot;&gt;EP 74: Mikko&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/78/&quot;&gt;EP 78: Nerdcore&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/81/&quot;&gt;EP 81: The Vendor&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/85/&quot;&gt;EP 85: Cam the Carder&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/92/&quot;&gt;EP 92: The Pirate Bay&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://darknetdiaries.com/episode/104/&quot;&gt;EP 104: Arya&lt;/a&gt;&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="podcasts" /><category term="podcasts" /><summary type="html">Having gone through almost the entire backlog of Darknet Diaries, these were my favourite episodes and the ones I recommend to people.</summary></entry><entry><title type="html">Ad-free podcasts</title><link href="https://dglencross.com/podcasts/adfree-podcasts/" rel="alternate" type="text/html" title="Ad-free podcasts" /><published>2022-04-10T00:00:00+00:00</published><updated>2022-04-10T00:00:00+00:00</updated><id>https://dglencross.com/podcasts/adfree%20podcasts</id><content type="html" xml:base="https://dglencross.com/podcasts/adfree-podcasts/">&lt;h1 id=&quot;ad-free-podcasts&quot;&gt;Ad-free podcasts&lt;/h1&gt;

&lt;p&gt;I listen to podcasts while dog-walking (generally 3 times a day), working out, doing chores, on the loo, falling asleep, generally moving around my house.&lt;/p&gt;

&lt;p&gt;I also find adverts really annoying. Despite the ability to skip forward, I just much prefer listening to ad-free podcasts. I have given up on a lot of podcasts that would otherwise be great, but the adverts are just to frequent or too annoying.&lt;/p&gt;

&lt;p&gt;Here is a selection of quality podcasts that are free of advertising. This is not an exhaustive list but they are all personal recommendations.&lt;/p&gt;

&lt;p&gt;Ad-free does not necessarily mean free though! I am happy to pay for quality, ad-free podcasts so a number of these are paid.&lt;/p&gt;

&lt;p&gt;I’ve also given a few honourary mentions for podcasts with advertising which isn’t too obstrusive.&lt;/p&gt;

&lt;p&gt;These are loosely in order of best (within their categories of paid/free).&lt;/p&gt;

&lt;h2 id=&quot;ad-free-and-zero-cost&quot;&gt;Ad-free and zero cost&lt;/h2&gt;

&lt;p&gt;Some of these might ask for you to support them, but I don’t consider that to be nearly as annoying as third-party adverts.&lt;/p&gt;

&lt;h3 id=&quot;econtalk&quot;&gt;&lt;a href=&quot;https://www.econtalk.org/&quot;&gt;EconTalk&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Nominally an economics podcast, but actually interviews with a wide range of interesting people on any topic.&lt;/p&gt;

&lt;h3 id=&quot;80000-hours&quot;&gt;&lt;a href=&quot;https://80000hours.org/podcast/&quot;&gt;80,000 Hours&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;“In-depth conversations about the world’s most pressing problems and what you can do to solve them.”&lt;/p&gt;

&lt;p&gt;80k is the approximate number of hours the average person will work in their lives, and the idea behind this podcast is that you should spend them trying to make the biggest impact possible.&lt;/p&gt;

&lt;p&gt;These are really in-depth (and usually really long) episodes on topics generally to do with improving the present or future.&lt;/p&gt;

&lt;p&gt;Also worth listening to is their &lt;a href=&quot;https://80000hours.org/after-hours-podcast/&quot;&gt;80k After Hours podcast&lt;/a&gt;, which is on similar themes but a bit looser on structure.&lt;/p&gt;

&lt;h3 id=&quot;fall-of-civilizations&quot;&gt;&lt;a href=&quot;https://fallofcivilizationspodcast.com/&quot;&gt;Fall of Civilizations&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;“A podcast that explores the collapse of different societies through history.”&lt;/p&gt;

&lt;p&gt;Similar to Hardcore History, but on a theme of failed civilisations. Episodes are pretty rare but very long - worth going through the backlog.&lt;/p&gt;

&lt;h3 id=&quot;the-history-of-the-twentieth-century&quot;&gt;&lt;a href=&quot;https://historyofthetwentiethcentury.com/&quot;&gt;The History of the Twentieth Century&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Unlike Fall of Civs, this podcast is comprised of many, many shorter episodes, progressing largely-chronologically through the 20th century.&lt;/p&gt;

&lt;p&gt;I can’t actually swear this podcast never has advertising as at time of writing there are about 300 episodes, but I am on episode 61 and so far there is none. The host asks for donations via Patreon so hopefully there are none.&lt;/p&gt;

&lt;h3 id=&quot;the-wright-show&quot;&gt;&lt;a href=&quot;https://meaningoflife.tv/programs/wrightshow&quot;&gt;The Wright Show&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Robert Wright interviews a load of different, interesting people. I particularly like his podcast with Mickey Kaus. They also do The Parrot Room together, which is listed under the paid podcasts.&lt;/p&gt;

&lt;h3 id=&quot;conversations-with-tyler&quot;&gt;&lt;a href=&quot;https://conversationswithtyler.com/&quot;&gt;Conversations With Tyler&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Quite similar to EconTalk but with Tyler Cowen, who has a very distinct interview style which keeps the conversation moving quickly. Tyler is an economist, but like EconTalk the topics are quite broad.&lt;/p&gt;

&lt;h3 id=&quot;the-entrepreneur-first-podcast&quot;&gt;&lt;a href=&quot;https://www.joinef.com/stories/podcasts/&quot;&gt;The Entrepreneur First Podcast&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;EF is an organisation that brings aspiring company founders together and invests in them. Their podcast is various interviews with people who have gone through the process.&lt;/p&gt;

&lt;p&gt;One of my friends has gone through EF, which is how I discovered the podcast.&lt;/p&gt;

&lt;p&gt;If you like it, you can also try The Founder’s Mindset podcast which is on the same link.&lt;/p&gt;

&lt;h3 id=&quot;the-bugle&quot;&gt;&lt;a href=&quot;https://www.thebuglepodcast.com/&quot;&gt;The Bugle&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;I am generally not that in to comedy podcasts, but The Bugle has long been a favourite even after John Oliver left.&lt;/p&gt;

&lt;h3 id=&quot;the-cryptid-factor&quot;&gt;&lt;a href=&quot;https://www.thecryptidfactor.com/&quot;&gt;The Cryptid Factor&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Despite just saying I don’t like comedy podcasts that much, this is a great one. Rhys Darby et al talk about Cryptozoology. Rarely released but a good backlog to work through.&lt;/p&gt;

&lt;h3 id=&quot;pushback-with-aaron-maté&quot;&gt;&lt;a href=&quot;https://thegrayzone.com/pushback/&quot;&gt;Pushback with Aaron Maté&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Giving an alternative view to a lot of coverage of American and worldwide issues. For example, interviews with Russian representatives before and during the war with Ukraine. Interesting to hear viewpoints that I don’t hear elsewhere, though possibly controversial. For example, Aaron had an interview with a Russian representative to the UN just before the Ukrainian invasion, who said a lot of things which turned out to be completely false once the war started. Still, entertaining to hear different viewpoints.&lt;/p&gt;

&lt;h3 id=&quot;the-naked-pravda&quot;&gt;&lt;a href=&quot;https://meduza.io/en/podcasts/the-naked-pravda&quot;&gt;The Naked Pravda&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Counterpoint to the one above - a Russian-made and Russia-focused podcast, but banned from Russia. Definitely anti-Putin.&lt;/p&gt;

&lt;h3 id=&quot;naturally-speaking&quot;&gt;&lt;a href=&quot;https://naturallyspeaking.blog/category/podcasts/&quot;&gt;Naturally Speaking&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Possibly the most niche one here, scientific podcasts from the Institute of Biodiversity, Animal Health and Comparative Medicine at the University of Glasgow.&lt;/p&gt;

&lt;h3 id=&quot;more-or-less-behind-the-statistics&quot;&gt;&lt;a href=&quot;https://www.bbc.co.uk/programmes/p02nrss1/episodes/downloads&quot;&gt;More or Less: Behind the Statistics&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;A BBC show which investigates any interesting numbers in the news, to see where they come from and how accurate they are. Ad-free if you are based in the UK or VPN in to it.&lt;/p&gt;

&lt;h3 id=&quot;in-our-time&quot;&gt;&lt;a href=&quot;https://www.bbc.co.uk/programmes/b006qykl&quot;&gt;In Our Time&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;A podcast on ideas, people and events. It is an excellent podcast to fall asleep to. The BBC used to produce a lot of good ad-free podcasts (if you are based in the UK or VPN in to it) but unfortunately they’ve moved a bunch to their BBC Sounds app. I am not interested in that.&lt;/p&gt;

&lt;h3 id=&quot;hear-this-idea&quot;&gt;&lt;a href=&quot;https://hearthisidea.com/&quot;&gt;Hear This Idea&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Interviews about philosophy, the social sciences and effective altruism.&lt;/p&gt;

&lt;h3 id=&quot;astral-codex-ten-podcast&quot;&gt;&lt;a href=&quot;https://podcasts.apple.com/gb/podcast/astral-codex-ten-podcast/id1295289140&quot;&gt;Astral Codex Ten Podcast&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;A guy literally just reads Astral Codex Ten articles aloud. This can be very weird when there are a lot of diagrams that he dictates - if you’re into ACT though it’s a different way to consume the articles.&lt;/p&gt;

&lt;p&gt;He has a patreon but I’m not into it nearly enough to donate.&lt;/p&gt;

&lt;h2 id=&quot;ad-free-and-paid&quot;&gt;Ad-free and paid&lt;/h2&gt;

&lt;p&gt;These podcasts are generally a few pounds/dollars a month. Some of these have free versions (but ad-supported).&lt;/p&gt;

&lt;h3 id=&quot;darknet-diaries&quot;&gt;&lt;a href=&quot;https://darknetdiaries.com/&quot;&gt;Darknet Diaries&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;A podcast about hacking and computer security. This is a great, great podcast - the best episodes are fantastic, and even the worst episodes are better than most other podcasts. I intend to do a post just on the best episodes of Darknet Diaries.&lt;/p&gt;

&lt;p&gt;My absolute favourite podcast, and sadly I am very close to getting through the backlog.&lt;/p&gt;

&lt;p&gt;There is an ad-supported version so you can try it out for free first.&lt;/p&gt;

&lt;h3 id=&quot;the-parrot-room&quot;&gt;&lt;a href=&quot;https://www.patreon.com/parrotroom&quot;&gt;The Parrot Room&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Every week Robert Wright and Mickey Kaus do an episode of The Wright Show for free, then do a longer subsequent episode for The Parrot Room. It’s essentially a continuation of the same conversation. If you like their Wright Show episode then it’s good value for money to keep listening.&lt;/p&gt;

&lt;h3 id=&quot;this-week-in-tech&quot;&gt;&lt;a href=&quot;https://twit.tv/&quot;&gt;This Week In Tech&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;TWiT is a podcast network of varying tech-related shows. You can get most of them ad-supported, or you can join the ‘club’ and get ad-free versions of them all, plus some bonus shows. I particularly like the following shows:&lt;/p&gt;

&lt;p&gt;This Week in Tech&lt;/p&gt;

&lt;p&gt;This Week in Google&lt;/p&gt;

&lt;p&gt;Also worth a listen:&lt;/p&gt;

&lt;p&gt;Security Now&lt;/p&gt;

&lt;p&gt;Tech News Weekly&lt;/p&gt;

&lt;h3 id=&quot;oh-god-what-now&quot;&gt;&lt;a href=&quot;https://www.patreon.com/ohgodwhatnow&quot;&gt;Oh God, What Now?&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;This is UK-focused, political podcast previously known as Remainiacs. Looking at the political news from the perspective of anti-Brexit, pro-European people.&lt;/p&gt;

&lt;h3 id=&quot;the-bunker&quot;&gt;&lt;a href=&quot;https://www.patreon.com/bunkercast&quot;&gt;The Bunker&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Another UK-focused podcast, from the same people as Oh God, What Now?, less to do with Britain’s relationship to the world and more on general politics and culture.&lt;/p&gt;

&lt;h3 id=&quot;making-sense&quot;&gt;&lt;a href=&quot;https://www.samharris.org/podcasts&quot;&gt;Making Sense&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Sam Harris’ podcast, often controversial but generally entertaining podcasts about current events. He is also very into meditation and the science of the brain but I don’t really get on with those episodes.&lt;/p&gt;

&lt;h2 id=&quot;honourable-mentions---ads-not-too-intrusive&quot;&gt;Honourable Mentions - ads not too intrusive&lt;/h2&gt;

&lt;p&gt;There are also some excellent podcasts which don’t have much advertising, or it is very unobtrusive.&lt;/p&gt;

&lt;h3 id=&quot;hardcore-history&quot;&gt;&lt;a href=&quot;https://www.dancarlin.com/hardcore-history-series/&quot;&gt;Hardcore History&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Dan Carlin’s Hardcore History is an amazing podcast. They consist of Dan Carlin telling a story for hours about a particular historical topic, and topics frequently run to multiple episodes.&lt;/p&gt;

&lt;p&gt;Episodes are very rarely released but usually many hours long, and if you haven’t listened then there’s a huge backlog to get through (though most of the old ones are paid).&lt;/p&gt;

&lt;p&gt;This almost made it to the ad-free list but he does have the occasional advert at the end of an episode. About as unobtrusive as can be. Some episodes have no adverts.&lt;/p&gt;

&lt;p&gt;There’s also the &lt;a href=&quot;https://www.dancarlin.com/addendum/&quot;&gt;Hardcore History Addendum&lt;/a&gt; for extra bits and interviews.&lt;/p&gt;

&lt;h3 id=&quot;common-sense-with-dan-carlin&quot;&gt;&lt;a href=&quot;https://www.dancarlin.com/product-category/common-sense-with-dan-carlin/&quot;&gt;Common Sense with Dan Carlin&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Another podcast from Dan Carlin, and another one with very rare episodes. USA-focused opinions from a libertarian perspective.&lt;/p&gt;

&lt;p&gt;Same situation as Hardcore History with adverts, mostly at the end and not at all annoying.&lt;/p&gt;

&lt;h3 id=&quot;clearer-thinking&quot;&gt;&lt;a href=&quot;https://www.clearerthinking.org/podcast&quot;&gt;Clearer Thinking&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;One of my favourite podcasts despite the adverts, Spencer Greenberg has really thoughtful ‘rationalist’ conversations about practical concepts and frameworks that can be applied to your own life. “Ideas that truly matter” according to the blurb.&lt;/p&gt;

&lt;p&gt;The adverts are pretty inoffensive and often just advertise things on the Clearer Thinking website. Third-party adverts are at least for relevant products, rather than just having random auto-inserted ads.&lt;/p&gt;

&lt;h3 id=&quot;rationally-speaking&quot;&gt;&lt;a href=&quot;http://rationallyspeakingpodcast.org/past-episodes/&quot;&gt;Rationally Speaking&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Quite similar to Clearer Thinking, Julia Galef interviews people about “the borderlands between reason and nonsense”. Good podcast, adverts aren’t too annoying, though there haven’t been any episodes for a while.&lt;/p&gt;

&lt;h3 id=&quot;lex-fridman&quot;&gt;&lt;a href=&quot;https://lexfridman.com/podcast/&quot;&gt;Lex Fridman&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Lex interviews people on a wide variety of topics. Extremely in-depth conversations. I only listen to the ones that sound interesting from the titles, they are very long and I find Lex to be a little irritating - but the great guests often make up for it, and he does ask good questions.&lt;/p&gt;

&lt;p&gt;In terms of ads, they take up about 8-10 minutes at the beginning of each episode, but Lex puts in bookmark timestamps so you can easily click to jump to the start of the actual conversation. After that, there are no more adverts.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;That’s a pretty long list! Most of my podcast listening involves zero advertising, with some exceptions for particularly good podcasts.&lt;/p&gt;

&lt;p&gt;I’m always looking for more to listen to so I’ll update this list when I do.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="podcasts" /><category term="podcasts" /><summary type="html">Ad-free podcasts</summary></entry><entry><title type="html">Theodore Roosevelt vs Speed Reading</title><link href="https://dglencross.com/books/roosevelt-vs-speed-reading/" rel="alternate" type="text/html" title="Theodore Roosevelt vs Speed Reading" /><published>2020-01-04T00:00:00+00:00</published><updated>2020-01-04T00:00:00+00:00</updated><id>https://dglencross.com/books/roosevelt-vs-speed-reading</id><content type="html" xml:base="https://dglencross.com/books/roosevelt-vs-speed-reading/">&lt;h2 id=&quot;theodore-roosevelt-vs-speed-reading&quot;&gt;Theodore Roosevelt vs Speed Reading&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The President manages to get through at least one book a day even when he is busy. Owen Wister has lent him a book shortly before a full evening’s entertainment at the White House, and been astonished to hear a complete review of it over breakfast. “Somewhere between six one evening and eight-thirty next morning, beside his dressing and his dinner and his guests and his sleep, he had read a volume of three-hundred-and-odd pages, and missed nothing of significance it contained.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;– from The Rise of Theodore Roosevelt by Edmund Morris.&lt;/p&gt;

&lt;p&gt;This biography of Teddy Roosevelt was fascinating, especially considering I had no real interest in the topic before deciding to read it.&lt;/p&gt;

&lt;p&gt;This may beg the question of why I read a huge book about him in the first place, for which the answer is - I played a lot of Red Dead Redemption 2, which caused me to get interested in the topic of cowboys, which led me to read “Cattle Kingdom - The Hidden History of the Cowboy West” by Christopher Knowlton (another excellent book, if that period appeals to you anyway), which features Roosevelt as a major ranch owner, which inspired me to find a biography of the guy.&lt;/p&gt;

&lt;p&gt;He’s a fascinating character and inspiring in the sheer amount of work he gets done. It is just incomprehensible to me, a man who didn’t write a blog post for ten months due to laziness.&lt;/p&gt;

&lt;p&gt;One thing that really stood out to me was his apparent ability to read at a prodigious rate. I read a lot of non-fiction (and some fiction) and would really like to increase the rate at which I read books. There are a lot of highly-rated books and not enough time to get through them.&lt;/p&gt;

&lt;p&gt;I wouldn’t say I am a particularly fast or slow reader but I had heard of speed reading and decided to look into it.&lt;/p&gt;

&lt;p&gt;However, as this blog from &lt;a href=&quot;https://www.scotthyoung.com/blog/2015/01/19/speed-reading-redo/&quot;&gt;Scott Young&lt;/a&gt; describes, it seems like the whole idea of it is just impossible. Supposedly just going above 500 words a minute is improbable because of how the eye works, and getting up to that speed would sacrifice comprehension.&lt;/p&gt;

&lt;p&gt;It seems like speed reading is just another fad idea which is not worth trying (file that along with barefoot running).&lt;/p&gt;

&lt;p&gt;But the question I come back to is - how did Roosevelt manage to get through books so quickly and retain information? Maybe it’s an apocryphal tale, or maybe he was a genius - which does seem apparent from his biography.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;On evenings like this, when he has no official entertaining to do, Roosevelt will read two or three books entire.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;How??&lt;/p&gt;

&lt;p&gt;N.B.
A couple of great quotes:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Wine makes me awfully fighty&lt;/em&gt; - from Roosevelt’s diary&lt;/p&gt;

&lt;p&gt;&lt;em&gt;While its first edict, promising to “hang, burn or drown any man that will ask for public improvements at the expense of the County” could have been worded more diplomatically, it at least voiced sound Republican sentiments&lt;/em&gt; - about Wild West politics&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="books" /><category term="books" /><summary type="html">Theodore Roosevelt vs Speed Reading</summary></entry><entry><title type="html">Thames Trot Ultra Marathon Race Review</title><link href="https://dglencross.com/running/ultra/thames-trot/" rel="alternate" type="text/html" title="Thames Trot Ultra Marathon Race Review" /><published>2019-11-30T00:00:00+00:00</published><updated>2019-11-30T00:00:00+00:00</updated><id>https://dglencross.com/running/ultra/thames-trot</id><content type="html" xml:base="https://dglencross.com/running/ultra/thames-trot/">&lt;h3 id=&quot;483-miles--8h32------37th-out-of-145-finishers-195-starters&quot;&gt;48.3 miles || 8h32  ||    37th out of 145 finishers (195 starters)&lt;/h3&gt;

&lt;p&gt;October 26th was the date of my first ultra-marathon, the Thames Trot. Officially it is called the ‘Thames Trot 50’ but in reality, it was 48 miles. It might seem like a marginal difference… but it does mean I can’t claim a 50-mile PB.&lt;/p&gt;

&lt;p&gt;The event starts in Oxford and almost entirely follows the Thames path towards the finish in Henley-On-Thames.&lt;/p&gt;

&lt;p&gt;My TLDR of my experience would be ‘the worst day of my life’. It was much harder than I expected and it rained all day long.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://www.dglencross.com/assets/images/thamestrot2019/route.png&quot; alt=&quot;The route&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The race starts in a hotel car park but quickly gets on the Thames path, which it turns out is largely just mud. It had been raining in the run-up to the event and the path had taken the hit.&lt;/p&gt;

&lt;p&gt;The race itself was well-organised. Although there was no on-course signage, everyone navigated by GPS watch and the aid stations were well-supplied. Plus all the marshalls were friendly and encouraging!&lt;/p&gt;

&lt;p&gt;My longest training run was 24 miles, and although pretty tiring I felt OK afterwards. 24 miles is pretty short for the longest training run when attempting a 50ish mile race, but my main target was (and still is) the Country to Capital race in January 2020.&lt;/p&gt;

&lt;p&gt;However, 24 miles on pavement turned out to be very different compared to the same distance on mud.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://www.dglencross.com/assets/images/thamestrot2019/mud.jpg&quot; alt=&quot;Mud&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Going into the race, I was suffering a very minor calf strain and trying to rest it. I was concerned it would flair up and cause me to drop out. I have a terrible track record for pulling out of races injured.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;mile 1 - feeling good and starting much too fast.&lt;/li&gt;
  &lt;li&gt;first 10 miles - I could feel my calf but was ticking off the miles pretty quickly.&lt;/li&gt;
  &lt;li&gt;20 miles - I was already in way more pain than I was expecting.&lt;/li&gt;
  &lt;li&gt;24 miles - suffering. It was around the half-way point that I started to take walk breaks. On the bright side, my whole lower half hurt so much that my calf was no longer a distinct pain.&lt;/li&gt;
  &lt;li&gt;miles 24-40 - this is where I fell to pieces. I was walking a huge amount, being overtaken by people I had passed earlier. - miles 40-47 - I regained some composure and started a pattern of running/jogging quarter of a mile, then walking for a short distance, and repeat.&lt;/li&gt;
  &lt;li&gt;mile 48 - I was so close to the end that I managed to run it in without any more walking breaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I was in pain so early that the second half became a real struggle, both mentally and physically. The mud just saps the leg strength so much more than running on pavement. Constantly recalculating how long you’re going to be out running in the rain was demoralising. I’m doing 11 minutes per mile - it’s going to be 4 hours. Now I’m going at 12 minutes per mile - 4 and a half!&lt;/p&gt;

&lt;p&gt;My big mistake was running the first 20 miles too fast. In ultras, it is a very common tactic to walk the uphills as it takes too much out of your legs to run them. The Thames Trot is completely flat for the first 30 miles, so I ran without walking breaks until I could no longer keep it up, and by that time it was too late - I was wrecked.&lt;/p&gt;

&lt;p&gt;With about 6 miles to go, I caught up to a guy who had overtaken me earlier in the day. He had just fallen over and was walking. I told him about my quarter-mile run strategy and he joined me with it. It turned out he was having an identical experience to me - first ultra, came in confident after a good long run, suffering much earlier than he expected. It didn’t stop the pain but it helped to pass the time.&lt;/p&gt;

&lt;p&gt;The heaviest rain of the day fell during the last 6 miles. Crossing the finish line brought no real sense of satisfaction, but I was glad it was over. This was the worst experience of my life - the most difficult thing I have ever done.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://www.dglencross.com/assets/images/thamestrot2019/finish.jpg&quot; alt=&quot;The finishing straight&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Although it was a horrendous experience, I’ll be having another go in January 2020 - the Country to Capital 45 (which seems to actually be 43 miles). I’m hoping to start slower and be able to enjoy the experience a bit more!&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="running" /><category term="ultra" /><category term="running" /><category term="ultra" /><summary type="html">48.3 miles || 8h32 || 37th out of 145 finishers (195 starters)</summary></entry><entry><title type="html">Machine learning in Python with Scikit-learn - a crash course</title><link href="https://dglencross.com/machine%20learning/machine-learning/" rel="alternate" type="text/html" title="Machine learning in Python with Scikit-learn - a crash course" /><published>2019-02-09T00:00:00+00:00</published><updated>2019-02-09T00:00:00+00:00</updated><id>https://dglencross.com/machine%20learning/machine-learning</id><content type="html" xml:base="https://dglencross.com/machine%20learning/machine-learning/">&lt;h2 id=&quot;or-how-to-fake-your-way-through-machine-learning&quot;&gt;Or: how to fake your way through machine learning&lt;/h2&gt;

&lt;p&gt;At my place of work, we recently had a hackathon (during work hours!) in which we could spend 2 days trying to create something that would benefit the company. My idea was to use machine learning for this:&lt;/p&gt;

&lt;h2 id=&quot;predicting-loan-defaults&quot;&gt;Predicting loan defaults&lt;/h2&gt;

&lt;p&gt;I would use existing customer data, with knowledge of who did and did not default (i.e. not pay us back), to then predict whether a currently-applying customer would default or not.&lt;/p&gt;

&lt;p&gt;My knowledge of machine learning was almost entirely from doing Andrew Ng’s machine learning course on Coursera, but it had been about a year since I’d done it and had never implemented anything afterwards.&lt;/p&gt;

&lt;p&gt;So my 2-day hackathon was a bit of a crash course in machine learning. Here’s how I did it:&lt;/p&gt;

&lt;h3 id=&quot;the-basics&quot;&gt;The basics&lt;/h3&gt;

&lt;p&gt;This is &lt;em&gt;labelled training&lt;/em&gt;. I had a large amount of data (which I obviously cannot share!) with each row containing information that we knew about a customer at their point of application, and whether or not they had defaulted.&lt;/p&gt;

&lt;p&gt;This was a &lt;em&gt;classification&lt;/em&gt; exercise. That means I was just trying to classify each new customer as predicted to default or not default.&lt;/p&gt;

&lt;h3 id=&quot;the-plan&quot;&gt;The plan&lt;/h3&gt;

&lt;p&gt;I used &lt;a href=&quot;https://scikit-learn.org/stable/index.html&quot;&gt;Scikit-Learn&lt;/a&gt;, which seems to be a widely-popular library for doing machine learning in Python. It implements a wide range of machine learning algorithms so I would not have to do that work myself. This would be an exercise in plugging things together (I hoped!).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://imgs.xkcd.com/comics/machine_learning.png&quot; alt=&quot;Not far from the truth&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;first-steps---manipulating-data&quot;&gt;First steps - manipulating data&lt;/h3&gt;

&lt;p&gt;Scikit-Learn requires all inputs to be numerical. No text, no dates, not even any null/empty fields.&lt;/p&gt;

&lt;p&gt;My data was made of mostly publicly available information (e.g. Companies House data) plus information the customer themselves provide. This included plenty of blank fields - for example, newer companies would have less information on Companies House.&lt;/p&gt;

&lt;p&gt;Step one then - load in the data.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;training_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'application_info.csv'&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Read in data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;textstring-fields&quot;&gt;Text/string fields&lt;/h4&gt;

&lt;p&gt;An example text field: industry category&lt;/p&gt;

&lt;p&gt;Scikit-Learn won’t take string inputs. But you can’t just convert each string to a number - Scikit-Learn would treat that column as linearly related, but that makes no sense - if you label ‘Transport’ as 1, ‘Television’ as 2, ‘Catering’ as 3 and so on, you’re saying Television is halfway between Transport and Catering - which is meaningless.&lt;/p&gt;

&lt;p&gt;Instead, you need to have each industry category be its own column, and set the values to be 1 (for yes, this row is in this category) or 0 (for not).&lt;/p&gt;

&lt;p&gt;Scikit-Learn can do this for you, using a LabelBinarizer:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LabelBinarizer&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;lb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LabelBinarizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    
&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
&lt;span class=&quot;c1&quot;&gt;# add new generated columns to data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newColumn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])])&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;concat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;newColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Delete old label, add new ones
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remove&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;getting-rid-of-nulls&quot;&gt;Getting rid of nulls&lt;/h4&gt;

&lt;p&gt;There are choices to be made here. These are your potential strategies:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;delete rows with nulls. This was impractical for me - I would lose too much data&lt;/li&gt;
  &lt;li&gt;Replace nulls with something:
    &lt;ul&gt;
      &lt;li&gt;0&lt;/li&gt;
      &lt;li&gt;The mean of the rest of the values in that column&lt;/li&gt;
      &lt;li&gt;The median&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which makes the most sense? It depends on the data. For example, a customer’s total other loans - what does null mean in this instance? Should we assume they don’t have other loans, or just that we’re lacking that information? It’s a choice you have to make.&lt;/p&gt;

&lt;p&gt;Whatever you do, you will affect your overall result. It’s best to play around with these (which I’ll discuss later).&lt;/p&gt;

&lt;h4 id=&quot;dates&quot;&gt;Dates&lt;/h4&gt;

&lt;p&gt;You can do whatever you want here - it might just be easiest to delete dates if they are irrelevant.&lt;/p&gt;

&lt;p&gt;I just ignored dates. But if you want to use them, you could extract numerical data from them, e.g. day of the week 1-7, time of day, year etc.&lt;/p&gt;

&lt;h3 id=&quot;starting-to-process-data&quot;&gt;Starting to process data&lt;/h3&gt;

&lt;p&gt;What algorithm are you going to use? What parameters? How to start?&lt;/p&gt;

&lt;p&gt;Scikit-Learn has a pattern which allows you to try a load of things at once, a method called GridSearchCV.&lt;/p&gt;

&lt;p&gt;There is a great example of how to use this on my friend &lt;a href=&quot;https://github.com/dgmp88/dgmp88.github.io/blob/master/notebook/Scikit-Learn%20Patterns.ipynb&quot;&gt;George’s Github page&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.pipeline&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.linear_model&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LogisticRegression&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.svm&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SVC&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.model_selection&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GridSearchCV&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'scale'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; 
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'predict'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Set up some parameters
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'scale'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:[&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()],&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'predict'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LogisticRegression&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()],&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'predict__C'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'scale'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:[&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StandardScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()],&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'predict'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SVC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()],&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;'predict__C'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;gs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GridSearchCV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoring&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'accuracy'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StratifiedKFold&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;return_train_score&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;best_estimator_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We define some parameters, define the algorithms we want to use, and search every combination of the parameters.&lt;/p&gt;

&lt;p&gt;We have also introduced &lt;em&gt;scalers&lt;/em&gt;, to scale the data.&lt;/p&gt;

&lt;p&gt;This code will score each one based on accuracy (how many predictions were correct) and print the most successful combination of parameters.&lt;/p&gt;

&lt;h4 id=&quot;scaling-data&quot;&gt;Scaling data&lt;/h4&gt;

&lt;p&gt;Many machine learning algorithms require/prefer data to be scaled - i.e. all data in the same range. Scikit-Learn includes a bunch of scalers, and you can try out what works. I ended up using a MinMaxScaler, scaling all my data to between 0 and 1. This is something to play around with (and your experimentation can be automated, as above).&lt;/p&gt;

&lt;h4 id=&quot;scoring&quot;&gt;Scoring&lt;/h4&gt;

&lt;p&gt;Scoring by accuracy can be a bad idea - and in my case, it was. The % of people who default on loans is pretty small. Imagine it was 5% (not the real value in my data). Then any algorithm which predicts no one to default will score 95% accuracy.&lt;/p&gt;

&lt;p&gt;This is exactly what happened to me - very high accuracy, absolutely useless result.&lt;/p&gt;

&lt;p&gt;Alternative scoring methods are available.&lt;/p&gt;

&lt;p&gt;This one is more useful in the instance where you have very few positive results:&lt;/p&gt;

&lt;h4 id=&quot;precision-and-recall&quot;&gt;Precision and recall&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Precision_and_recall&quot;&gt;Precision and recall&lt;/a&gt; are alternative ways of evaluating models:&lt;/p&gt;

&lt;p&gt;precision = true positives / (true positives + false positives)
Conceptually, this means - of the customers I said would default, how many actually did?&lt;/p&gt;

&lt;p&gt;recall = true positives / (true positives + false negatives)
Conceptually, this means - of the customers who defaulted, how many did I correctly identify?&lt;/p&gt;

&lt;p&gt;Example: 
35 people default. 
You predict 50 will default, of which 25 are correctly identified and 25 are falsely accused. You also miss 10 defaulters, who you mark as good customers.
Precision = 25 / (25 + 25) = 50%
Recall = 25 / (25 + 10) = 71%&lt;/p&gt;

&lt;p&gt;For further reading: &lt;a href=&quot;https://towardsdatascience.com/precision-vs-recall-386cf9f89488&quot;&gt;Toward Science blog&lt;/a&gt;&lt;/p&gt;

&lt;h4 id=&quot;f1-score&quot;&gt;&lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html&quot;&gt;F1 score&lt;/a&gt;&lt;/h4&gt;

&lt;p&gt;The F1 score tries to balance precision and recall, so is very useful in this case - and indeed in any case where your distribution of results is not evenly weighted.&lt;/p&gt;

&lt;h4 id=&quot;confusion-matrix&quot;&gt;Confusion matrix&lt;/h4&gt;

&lt;p&gt;A confusion matrix can be a really great way of visualising how your model is doing.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://www.dglencross.com/assets/images/confusion_matrix.png&quot; alt=&quot;Confusion matrix&quot; /&gt;&lt;/p&gt;

&lt;p&gt;When using a confusion matrix, your aim is to maximise the numbers in the top left and bottom right - these are your true negatives and true positives. The bottom left and top right represent mis-categorised customers, so should be minimised.&lt;/p&gt;

&lt;p&gt;In our case, we start by assuming no one defaults (before starting this project). Therefore all our customers are on the left - the right side of the matrix contains no one.&lt;/p&gt;

&lt;p&gt;Our aim is to move as many people as possible from the bottom left (people who you say are good customers but actually default) to the bottom right (number of people you correctly identify as being defaulters) without increasing the number in the top right (number of people you accuse of being defaulters who are not).&lt;/p&gt;

&lt;p&gt;You can get a text confusion matrix by comparing your test results with your test predictions:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.metrics&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;confusion_matrix&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;confusion_matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_pred&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And draw an image as I have done like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;seaborn&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df_cm&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;matrix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
                  &lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;01&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;heatmap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_cm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;annot&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fmt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'g'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cmap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PiYG&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;center&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;train_test_split&quot;&gt;train_test_split&lt;/h4&gt;

&lt;p&gt;To prove what you’re doing is right, you need to split your data into a training section and a test section. You ‘fit’ the algorithm using the training data, and test how well it works using the test data.&lt;/p&gt;

&lt;p&gt;Anything else is cheating - you can get very good results by testing against the whole data set, but you cannot learn anything from it - it will not be a good predictor of future unseen data.&lt;/p&gt;

&lt;p&gt;Scikit-Learn has a ‘train_test_split’ method which will divide up your data for you (see below).&lt;/p&gt;

&lt;h2 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h2&gt;

&lt;p&gt;Once I’d decided on my algorithm with the most success - a &lt;a href=&quot;https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html&quot;&gt;RandomForestClassifier&lt;/a&gt; - I put it all together:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.impute&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SimpleImputer&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.pipeline&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.ensemble&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.model_selection&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Now do the rest
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Defaulted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;columns&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RandomForestClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_estimators&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'impute'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SimpleImputer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;missing_values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nan&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'constant'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fill_value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'scale'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MinMaxScaler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;feature_range&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'predict'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;x_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random_state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;27&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# make predictions against test data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y_pred&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So I’ve told Scikit-Learn to impute (fill) missing values with 0, scale all my data to between 0 and 1, and run a RandomForestClassifier with 80% of the data as my training set.&lt;/p&gt;

&lt;p&gt;You can then plot a confusion matrix and see how well you’ve done.&lt;/p&gt;

&lt;h4 id=&quot;prediction-confidence-levels&quot;&gt;Prediction confidence levels&lt;/h4&gt;

&lt;p&gt;Alternative prediction method:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# make predictions against test data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y_pred&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict_proba&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will give you the results along with confidence levels of the algorithm. Each ‘prediction’ has an estimate for its confidence of the negative and positive result, e.g. [0.2, 0.8]. These will always add up to 1.&lt;/p&gt;

&lt;p&gt;If it’s important to be very confident, you could filter out everything under (for example) 80% confidence.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;So that’s it - in 2 days I put together a fairly good estimator for whether someone would default on their loan or not, using the above code.&lt;/p&gt;

&lt;p&gt;It doesn’t really require a huge understanding of machine learning, just the basics.&lt;/p&gt;

&lt;p&gt;In summary, your key steps:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Massage data, remove nulls etc&lt;/li&gt;
  &lt;li&gt;Scale your data&lt;/li&gt;
  &lt;li&gt;Separate out training and test data&lt;/li&gt;
  &lt;li&gt;Fit model to training data&lt;/li&gt;
  &lt;li&gt;Test model against test data&lt;/li&gt;
  &lt;li&gt;Evaluate your results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s it - from a basic understanding of machine learning, but with no experience of using it, you should be able to hack your way through a simple project like this.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="machine learning" /><category term="machine learning" /><category term="scikit learn" /><category term="software development" /><category term="programming" /><category term="python" /><summary type="html">Or: how to fake your way through machine learning</summary></entry><entry><title type="html">Book Review: “Architects of Intelligence: The truth about AI from the people building it” by Martin Ford</title><link href="https://dglencross.com/book%20review/review-architects-of-intelligence/" rel="alternate" type="text/html" title="Book Review: “Architects of Intelligence: The truth about AI from the people building it” by Martin Ford" /><published>2019-02-05T00:00:00+00:00</published><updated>2019-02-05T00:00:00+00:00</updated><id>https://dglencross.com/book%20review/review-architects-of-intelligence</id><content type="html" xml:base="https://dglencross.com/book%20review/review-architects-of-intelligence/">&lt;h2 id=&quot;tldr-of-book&quot;&gt;TLDR of book&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Interviews with big names in the world of artificial intelligence&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Most of them are very sceptical about ‘artificial general intelligence’ emerging any time soon&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;No one’s too bothered about Nick Bostrom’s worries written about in ‘Superintelligence’&lt;/li&gt;
  &lt;li&gt;Random change&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;mildly-interesting-very-long&quot;&gt;Mildly interesting, very long&lt;/h2&gt;

&lt;p&gt;Sometimes I read a really highly-rated book, do not get on with it at all, and wonder if I completely missed the point. A recent example is “The Master and Margarita” by Mikhail Bulgakov, supposedly one of the best Russian novels of the twentieth century. I struggled. Afterwards, I looked up summaries of it and discovered that yes, I had completely missed the point. It’s full of symbols from Freemasonry. It is apparently a ‘response to aggressive atheistic propaganda’. I missed all this.&lt;/p&gt;

&lt;p&gt;‘Architects of Intelligence’ is very highly rated on Goodreads, about the same as ‘The Master and Margarita’, and much like that, I struggled.&lt;/p&gt;

&lt;p&gt;As a software developer, and someone with an interest in machine learning (not a &lt;em&gt;huge&lt;/em&gt; interest, but an interest), I thought this would be really good. I enjoyed Nick Bostrom’s ‘Superintelligence’, although bits of it were a bit heavy going. He is, in fact, one of those interviewed in this book.&lt;/p&gt;

&lt;p&gt;My key issues with this book are basically that the author asks the same questions to all the interviewees. Alright, that is what he set out to do - he explicitly says that’s what he’s doing in the intro. And I do get it - it means you hear what all these different experts think on the same topics, and you can see the differences in their opinions.&lt;/p&gt;

&lt;p&gt;I am not so interested in AI that I actually had any idea who most of these people were. A friend who works in this area was quite impressed by the list of interviewees - he said the book started strong (in terms of the interviewees) and got weaker as it went on when I showed him the table of contents. For me though, I had only heard of a very small number of these - Nick Bostrom, Ray Kurzweil. That’s about it.&lt;/p&gt;

&lt;p&gt;So for me, it was a lot of people I’d never heard of, giving largely similar answers to the same questions. Ford emphasises how different the opinions are on the question of when Artificial General Intelligence will emerge. I got the sense though, that there are such unsolved problems that need to be figured out before AGI can emerge, that no one really had any idea. Plus, most of them didn’t even want to make a prediction.&lt;/p&gt;

&lt;h2 id=&quot;good-bits&quot;&gt;Good bits&lt;/h2&gt;

&lt;p&gt;Despite my being very negative, there were some interesting ideas:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;every interviewer asked about self-driving cars was much more pessimistic about them than I would have thought. My knowledge is solely based on news/tech articles about self-driving cars, and from that I got the impression that they were a few years away. The interviewees as a group were much more likely to predict at least 10 years. That doesn’t bode well for Uber.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;no one was impressed by the Turing test, which in itself is not a surprise. But one interviewee spoke about the ‘coffee test’. A robot is placed in front of a normal house, one that it has never seen before, and has to go in and make a cup of coffee. The robot would have to find the kitchen, find all the coffee-making apparatus etc.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Nick Bostrom spoke about the risk of amateurs flying drone across airports - something &lt;a href=&quot;https://en.wikipedia.org/wiki/Gatwick_Airport_drone_incident&quot;&gt;we’ve seen recently&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;I feel a bit guilty saying I didn’t enjoy this book, and I wonder whether it’s just not aimed at me. But I think I could have learned the same information from survey results. Most interviews were too similar to be interesting and the whole book dragged for me. Though based on the Goodreads reviews, this is not a popular opinion. Unless you are particularly fascinated by AI and have heard of some of these people - I could not recommend this book.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="book review" /><category term="book review" /><category term="artificial intelligence" /><category term="machine learning" /><summary type="html">TLDR of book</summary></entry><entry><title type="html">Running SQL Files from other SQL Files with MSSQL</title><link href="https://dglencross.com/mssql/mssql-call-other-sql-files/" rel="alternate" type="text/html" title="Running SQL Files from other SQL Files with MSSQL" /><published>2019-01-17T00:00:00+00:00</published><updated>2019-01-17T00:00:00+00:00</updated><id>https://dglencross.com/mssql/mssql-call-other-sql-files</id><content type="html" xml:base="https://dglencross.com/mssql/mssql-call-other-sql-files/">&lt;p&gt;During &lt;a href=&quot;https://dglencross.com/testing/integration-testing-with-roundhouse/&quot;&gt;my work with RoundhousE&lt;/a&gt;, I wanted to ensure the SQL scripts ran in the order I wanted. Foreign keys referenced things which had to exist first (for example an account has to belong to a customer, so the customer has to be created first).&lt;/p&gt;

&lt;p&gt;RoundhousE, within a given folder, will just run scripts in alphabetical order. So you could name them “0001….sql”, “0002….sql” but what happens if you have 100 scripts and you want to insert one in the middle? Then renaming becomes a pain. Instead, we write one ‘master’ SQL file which will call the others (note - RoundhousE will only load SQL scripts, which is why I had to do it like this).&lt;/p&gt;

&lt;p&gt;The data scripts I had generated were just lots of “insert …” statements, so nothing complicated. But this was a slightly more difficult problem to solve than I first expected.&lt;/p&gt;

&lt;h2 id=&quot;solution&quot;&gt;Solution&lt;/h2&gt;

&lt;p&gt;Essentially we want to read in the file as a string, and then execute it as dynamic SQL.&lt;/p&gt;

&lt;h3 id=&quot;loading-the-file&quot;&gt;Loading the file&lt;/h3&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;DECLARE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FileContents&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;NVARCHAR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;MAX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FileContents&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BulkColumn&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;OPENROWSET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BULK&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'C:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\t&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;emp&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;yfile.sql'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SINGLE_NCLOB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A CLOB is a Character Large Object in SQL, basically a huge list of characters. NCLOB is the same but in nvarchar format.&lt;/p&gt;

&lt;p&gt;You might now see this message:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Error Number: 4809, Message: SINGLE_NCLOB requires a UNICODE (widechar) input file. The file specified is not Unicode.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;OPENROWSET wants the files you read to be in Unicode, specifically UTF-16. I use Notepad++, so I clicked Encoding -&amp;gt; ‘Encode in UCS-2 BE BOM’ (which is essentially equivalent to UTF-16).&lt;/p&gt;

&lt;p&gt;With that done, the file loads successfully.&lt;/p&gt;

&lt;p&gt;Now we run it:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp_executesql&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FileContents&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And your SQL in the other file (myfile.sql) should be run.&lt;/p&gt;

&lt;p&gt;You can add multiple commands like this to run any SQL file you want (provided the encoding is correct!).&lt;/p&gt;

&lt;p&gt;I thought this problem would be way easier than it turned out to be, so writing it up in case anyone else comes across this.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="mssql" /><category term="software development" /><category term="programming" /><category term="sql" /><category term="mssql" /><summary type="html">During my work with RoundhousE, I wanted to ensure the SQL scripts ran in the order I wanted. Foreign keys referenced things which had to exist first (for example an account has to belong to a customer, so the customer has to be created first).</summary></entry><entry><title type="html">Integration testing with C# and RoundhousE</title><link href="https://dglencross.com/testing/integration-testing-with-RoundhousE/" rel="alternate" type="text/html" title="Integration testing with C# and RoundhousE" /><published>2019-01-16T00:00:00+00:00</published><updated>2019-01-16T00:00:00+00:00</updated><id>https://dglencross.com/testing/integration-testing-with-RoundhousE</id><content type="html" xml:base="https://dglencross.com/testing/integration-testing-with-RoundhousE/">&lt;p&gt;Recently at work, I was tasked with creating some integration tests from scratch for a project which only had unit tests. The current way we run integration tests on other projects was too slow (adding data via Entity Framework) so they wanted a new approach. A colleague suggested RoundhousE.&lt;/p&gt;

&lt;h2 id=&quot;roundhouse&quot;&gt;RoundhousE&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/chucknorris/roundhouse&quot;&gt;https://github.com/chucknorris/roundhouse&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RoundhousE (that E is really annoying me) is, in their own words “a database migrations engine that uses plain old SQL Scripts to transition a database from one version to another”. It has versioning so you can easily know whether a database needs upgrading, or rollback to a previous version if necessary (and you’ve set it up properly).&lt;/p&gt;

&lt;p&gt;However, all I want it to do is set up local databases for integration testing, then delete them afterwards.&lt;/p&gt;

&lt;h2 id=&quot;creating-a-database&quot;&gt;Creating a database&lt;/h2&gt;

&lt;p&gt;Firstly, wherever you’re running this should have MS SQL Server installed and running.&lt;/p&gt;

&lt;p&gt;RoundhousE packages rh.exe, which when called will do the following:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;create a database if it doesn’t already exist (wherever you tell it to)&lt;/li&gt;
  &lt;li&gt;in a pre-determined order, run through folders looking for SQL scripts and run any it finds (check out &lt;a href=&quot;https://github.com/chucknorris/roundhouse/wiki/RoundhousE-Script-Order&quot;&gt;https://github.com/chucknorris/roundhouse/wiki/RoundhousE-Script-Order&lt;/a&gt; for the ordering)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Easy right? I needed 2 databases for my integration tests, so I ran through the process twice.&lt;/p&gt;

&lt;p&gt;This is the bat script for creating a database:&lt;/p&gt;

&lt;div class=&quot;language-bat highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;kd&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;DIR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%~d0%~p0&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt;

    &lt;span class=&quot;kd&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;database&lt;/span&gt;.name&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;sql&lt;/span&gt;.files.directory&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;%DIR%&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;db&lt;/span&gt;\&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;SET&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;server&lt;/span&gt;.database&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;(local)&quot;&lt;/span&gt;

    :: &lt;span class=&quot;kd&quot;&gt;roundhouse&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;call&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;%DIR%&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;console\rh.exe&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%database&lt;/span&gt;.name&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%sql&lt;/span&gt;.files.directory&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%server&lt;/span&gt;.database&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/silent /drop
    &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;%DIR%&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;console\rh.exe&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/d&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%database&lt;/span&gt;.name&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%sql&lt;/span&gt;.files.directory&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;vm&quot;&gt;%server&lt;/span&gt;.database&lt;span class=&quot;err&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;/silent /simple
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that I am dropping the table first, just in case it is still hanging around (for example, if the last run failed before tear down).&lt;/p&gt;

&lt;h2 id=&quot;generating-data-scripts&quot;&gt;Generating data scripts&lt;/h2&gt;

&lt;p&gt;I’m using Microsoft SQL Server, so I used its functionality to generate scripts for creating the tables. One of my databases was very small, so I dumped the entirety of the data into scripts for RoundhousE to use.&lt;/p&gt;

&lt;p&gt;The other database - our main database at work - was much too large to dump the data. Technically it would work, but the tests would be too slow to set up. I dumped individual tables of static data and created some scripts of specific user data. This was quite slow - I need a better tool to do this. After my proof of concept is finished, I will try to find something better.&lt;/p&gt;

&lt;h2 id=&quot;ignoring-versioning&quot;&gt;Ignoring versioning&lt;/h2&gt;

&lt;p&gt;Part of the appeal of RoundhousE is its versioning tools. If you try to deploy the same version number to a database that you already deployed there, and any of the files have changed, RoundhousE will complain. Equally, if it detects the files haven’t changed, it will assume it doesn’t need to update.&lt;/p&gt;

&lt;p&gt;In my use case, I just want it to deploy afresh each time. RoundhousE stores this versioning info in tables in the database, so before each test, I drop the databases (if they don’t exist then it just carries on). This way it redeploys every time and I pay no attention to RoundhousE’s versioning tools.&lt;/p&gt;

&lt;h2 id=&quot;piecing-it-all-together&quot;&gt;Piecing it all together&lt;/h2&gt;

&lt;p&gt;I used XUnit to run my tests. For this, you define a ‘database fixture’ like so:&lt;/p&gt;

&lt;div class=&quot;language-csharp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;CollectionDefinition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Database collection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;IntegrationTestBase&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ICollectionFixture&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DatabaseFixture&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// deliberately left empty&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We define our DatabaseFixture class which is responsible for the creation and dropping of databases. In the constructor, I call the bat file for creating the database (one call for each database).&lt;/p&gt;

&lt;p&gt;DatabaseFixture will implement IDisposable, so in your Dispose() method you can add calls to a bat file to drop these databases (this is assuming you want them dropped after each run).&lt;/p&gt;

&lt;p&gt;For any test class, we annotate it with:&lt;/p&gt;

&lt;div class=&quot;language-csharp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;Collection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Database collection&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This way XUnit treats them as part of the same collection and will run the creation/deletion once.&lt;/p&gt;

&lt;p&gt;End result - all working.&lt;/p&gt;

&lt;h2 id=&quot;using-roundhouse-to-run-tests-in-parallel&quot;&gt;Using RoundhousE to run tests in parallel&lt;/h2&gt;

&lt;p&gt;If you want to run tests in parallel, just decouple the database setup and tear down from the tests running. Say you have 4 processes all running different tests, you would create the database, run each process, and once they’ve all finished, drop the database again. I did play around with creating different databases for each test run and got it working - but it was a bit messy and ultimately pointless.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="testing" /><category term="testing" /><category term="integration testing" /><category term="software development" /><category term="programming" /><category term="C#" /><summary type="html">Recently at work, I was tasked with creating some integration tests from scratch for a project which only had unit tests. The current way we run integration tests on other projects was too slow (adding data via Entity Framework) so they wanted a new approach. A colleague suggested RoundhousE.</summary></entry><entry><title type="html">Book Review: “Developer Hegemony: The Future of Labor” by Eric Dietrich</title><link href="https://dglencross.com/book%20review/review-developer-hegemony/" rel="alternate" type="text/html" title="Book Review: “Developer Hegemony: The Future of Labor” by Eric Dietrich" /><published>2019-01-08T00:00:00+00:00</published><updated>2019-01-08T00:00:00+00:00</updated><id>https://dglencross.com/book%20review/review-developer-hegemony</id><content type="html" xml:base="https://dglencross.com/book%20review/review-developer-hegemony/">&lt;h2 id=&quot;tldr-of-book&quot;&gt;TLDR of book&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;There are more software development jobs than there are developers - demand outweighs supply&lt;/li&gt;
  &lt;li&gt;Being a programmer in a corporation is a terrible financial deal, and the path to climbing the ladder through hard work is too slow to achieve within your lifetime. Corporations are not designed properly for software developers&lt;/li&gt;
  &lt;li&gt;As a developer, the only way to get ahead in a corporation is abandoning programming and becoming an opportunist - but this means sacrificing your ethics&lt;/li&gt;
  &lt;li&gt;Developers should quit corporations and create their own “efficiencer” firms, in which companies come to them with problems and these small firms design the solutions and create the software&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;highlights&quot;&gt;Highlights&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Opening chapters - fictional description of how the work of the future might look&lt;/li&gt;
  &lt;li&gt;Talking about ridiculous (but widespread) hiring/performance review/compensation practices&lt;/li&gt;
  &lt;li&gt;How to get ahead as an opportunist/sociopath in the corporate world (though why you shouldn’t)&lt;/li&gt;
  &lt;li&gt;A vision for the future of work, and advice about how you can get there&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Programming is not a calling, and it’s not a craft. It’s just automation that increases top line revenue through product or reduces bottom line costs through efficiency.” 
from chapter 32, titled “The Programmer’s Escape Plan”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This blunt statement is one of a number of examples in the book about the future of development work that made me reflect about my own career in software. I’ve never really been one to be obsessed with ‘the craft’ of development, but I am sure that sentence would annoy a number of people I have met before.&lt;/p&gt;

&lt;p&gt;There was plenty of accurate skewering of my own attitudes though. The pragmatist viewpoint is summarised by someone I work with - his internal profile on an app we use says something along the lines of “I have a real passion for staying alive and earning money allows me to accommodate that.” Pragmatists, according to Dietrich, accept the status quo and justify their lack of ambition by saying they don’t care about the job anyway:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“Pragmatists have their real thing that they care about, and it isn’t the job they’re doing. This is an entirely rational means of ego salve, similar to a teenager making a big to-do over how he doesn’t “believe in” the prom because of some philosophy or another that he’s adopted. Getting dates isn’t easy and the attempt may mean embarrassment, so it’s a lot safer to create a choice narrative around not trying in the first place.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I.e. Pretend not to care, because it’s easier than accepting that you aren’t going to make a difference in your company. Not that Dietrich accuses ‘pragmatists’ of being lazy, but the descriptions were fairly spot on in places and made me laugh in self-recognition.&lt;/p&gt;

&lt;h2 id=&quot;getting-ahead-in-the-corporate-world&quot;&gt;Getting ahead in the corporate world&lt;/h2&gt;

&lt;p&gt;Although Dietrich is at pains to state that he does not endorse any of the methods he gives for corporate advancement, the section where he talks about how to become an ‘opportunist’ within a corporation is fascinating in a morbid kind of way.&lt;/p&gt;

&lt;p&gt;A memorable example he gives is of this: say you’ve taken over management of a team, and you realise they aren’t going to deliver a project on time. You have to spin the narrative that makes yourself look good, so you decide that the narrative will be that this team is overcommitting and underdelivering, but you will sort the problem out afterwards. To really make the point, you identify a team member who vocally commits to getting work done, talk to them privately, and encourage them to keep up the behaviour. Now this team member is even more vocal, and you can make the point to superiors that this problem already existed in the team, you didn’t identify it quickly enough, but you will sort it out - rather than the alternative narrative that you just couldn’t organise a team. There’s no evidence of your conversation with the team member - but plenty of evidence of him/her committing to too much work in meetings.&lt;/p&gt;

&lt;p&gt;This example (definitely not advice!) is fairly immoral but I found this whole section to be really interesting, and in this case - pretty funny.&lt;/p&gt;

&lt;p&gt;In fact, some of the most entertaining sections are Dietrich describing what is so wrong with the current state of work for software developers. I won’t go into much detail but you can tell just from some of the chapter names:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Interviews, Induction, and Nonsense”
“Performance Reviews and Advancement Theater”
“Your Company Doesn’t Care About You”
“The Madness Of It All”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the chapter about interviewing:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;“the job interview is a process that was dreamed up on a whim about a century ago, never worked in the first place, and hasn’t been altered since. Unsurprisingly, they haven’t magically started working.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Although Dietrich acknowledges that he still interviews people because there isn’t (yet) an ideal alternative (although one is given in the ideal future work scenario that he envisages).&lt;/p&gt;

&lt;p&gt;Any developer will have experienced the kinds of nonsense he describes in these chapters, and I took some definite pleasure in him being so scathing about it.&lt;/p&gt;

&lt;h3 id=&quot;sidenote--interviews&quot;&gt;Sidenote : interviews&lt;/h3&gt;

&lt;p&gt;Personally, I think the current best way of doing this is to do real work with an interviewee - get them to do a project for 2/3 hours and then come in and talk it through. Possibly with a bit of pair programming between an employee and interviewee. However, this is really time-consuming for a candidate who has to do this for even a moderate amount of companies. Dietrich says he does a similar process.&lt;/p&gt;

&lt;h2 id=&quot;a-corporation-of-one&quot;&gt;A Corporation of One&lt;/h2&gt;

&lt;p&gt;One of the most interesting bits of advice is to stop thinking of yourself as an employee of a particular company, and that opportunists think of themselves as “single-person corporate entities unto themselves”:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“This may seem like a silly mental exercise at first blush, but its impact is profound. If you’re a software engineer, your boss asking you to put in a few fifty-hour weeks ahead of the upcoming release is perfectly reasonable, if something of a bummer. But if you’re a free agent, your client asking you to work ten hours a week for free is perfectly preposterous. The difference in thinking is that it’s now reasonable is for you to ask, “Why would I do that? What’s in it for me?””&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He is keen to emphasise that his methods for getting ahead in the corporate world are not advice, simply fact. His real advice is for how to move towards his vision of working in very small firms which are essentially contractors.&lt;/p&gt;

&lt;p&gt;He gives practical advice about how and why you should promote yourself and how to do it &lt;em&gt;cough blog cough&lt;/em&gt;, and other more dramatic advice such as leaving your current job if you are a programmer because you are already pigeon-holed and the only way to escape that is to escape the company.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;This book was a great mix of scepticism about the current environment and positivity about the future. Dietrich’s descriptions of future app dev firms (and what they should morph into) really grabbed me and I will be trying to do some of the things he advises to move in that direction.&lt;/p&gt;

&lt;p&gt;I would highly recommend reading this book, especially if you are a software developer.&lt;/p&gt;</content><author><name>Dave Glencross</name><email>dglencross@gmail.com</email><uri>https://dglencross.com</uri></author><category term="book review" /><category term="book review" /><category term="software development" /><category term="programming" /><summary type="html">TLDR of book There are more software development jobs than there are developers - demand outweighs supply Being a programmer in a corporation is a terrible financial deal, and the path to climbing the ladder through hard work is too slow to achieve within your lifetime. Corporations are not designed properly for software developers As a developer, the only way to get ahead in a corporation is abandoning programming and becoming an opportunist - but this means sacrificing your ethics Developers should quit corporations and create their own “efficiencer” firms, in which companies come to them with problems and these small firms design the solutions and create the software</summary></entry></feed>