programming by immersion

Setup

“ In the beginning, there was pie. And it was apple ” — Carl Sagan

First off, install python using the downloads page here. I won't be repeating the instructions here, so come back once you're done doing that. If it's installed and your PATH is setup correctly on your machine, you should be able to type the following into your command line:

python --version

And it should print something out like Python 3.10.8 or similar. Once you've got that working, it's time to install a plotting library because we're going to draw some graphs in a bit. So, check out matplotlib's instructions and run the two commands they have there to install pip and matplotlib. You'll probably be installing matplotlib globally, which I think is dumb, but we're not trying to be enterprise here. You're learning.

Assuming you're done, we can move on. For editing code, you can use notepad if you don't have anything else, or you can use SublimeText, Atom, Vim, NeoVim, Emacs, Notepad++, or VSCode if you want syntax highlighting. Heck, python might even have "IDLE" still and if so, you're free to use that. All you really need is the ability to write and save a plain text file, so go forth, use google, install an IDE, and then come back here to get started.

Our dataset

“ Datadee, Datadum, where's the one who stole my rum? ” — Edward Teach

Initially, I was thinking it would make sense to try to find some sort of sex related data set so that we could come up with funny little quips and such. But, when I went to search data.gov for sex I found that it just finds me datasets by sex rather than anything that relates to people actually having sex.

So there goes the idea of making graphs about virgins and then trying to hunt down if their professions happened to line up with the one you're trying to learn about right now.

But that's ok! We can use a dataset that's more near and dear to everyone's hearts! After all, most people, even across the world outside the USA, generally have a sort of bucket list item of going to "the big apple". So. Of course, the most obvious dataset for us to use in our learning is the leading cause of death in that city ¹. So. heres the dataset, download the CSV to your computer and then we'll get started.

Our first program, loading a file

“ You're gonna wake up and work hard at it ” — Shia LaBeouf

We could get started and I could define what a variable is, what a loop is, functions, and everything else you'd get in an intro lesson. But we're going to go full immersive learning approach and we're going to discuss our data, how we want to use it, and then how to represent what we want to do with code. Starting with a problem, reasoning about what sort of pieces you need, and how you want your program to work is a lot better than getting hung up on rote exercises that only lead to carpal tunnel and frequent naps.

Believe it or not. Programming is far less about the language you're using, and more about your thoughts hitting the page. It's certainly about syntax and a host of other things, but at it's core, we're trying to solve a problem or make a computer do something for us. So knowing as much as we can about the things that relate to what we want to do is important.

So, you've downloaded the dataset and it's sitting in a file named New_York_City_Leading_Causes_of_Death.csv in your downloads folder. That's a bad place for it, so instead, let's use the command line (or whatever your file explorer is) to move it to the same folder that we'll be programming in. This looks like this:

mkdir learning
cd learning
mv ~/Downloads/New_York_City_Leading_Causes_of_Death.csv .
touch lesson.py

Ok, I did more than I said we would with that last command. The touch command creates an empty file, and in this case it created one called lesson.py. There's nothing in there, and this is actually a valid program. You can tell because if you type in python lesson.py and press enter you'll see that it executes your command and seemingly does nothing at all.

So let's make it do something then so we can feel like we've made some progress. Inside of your lesson file, type this code in:

filename = "New_York_City_Leading_Causes_of_Death.csv"
with open(filename) as file_handle:
    the_first_line_in_the_file = file_handle.readline()
    print(the_first_line_in_the_file)

Then run python lesson.py and you should be greeted by the output:

$ python lesson.py
Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate

Assuming your python script and the CSV file is in the same folder, and you have python installed properly, then you should see the same. Let's pause here and let me tell you what sort of things you just used and then explain each.

filename is a variable, which is much like your math classes where you defined x = 1 and understood that every other time you referred to x it had the value of 1. This is much the same, except in this particular case we've defined the variable to hold onto a list of letters.
"New_York_City_Leading_Causes_of_Death.csv" this list of letters, or list of characters, is known as a String and are defined by you wrapping a bunch of words in double quotes. There's a lot of stuff that we can do with strings, but we'll see more on that in a minute.
with open(filename) as file_handle As I noted before, much like x = 1 means that you can use x whereever you would use a 1, we use the filename variable as an input argument to the built in function call open. To continue the math analogy, this is like f(x) in your classes, where we're applying a function to the input we defined. In our case, we're opening a file for reading that has the name of whatever String is inside of our filename variable.
the with keyword is used to handle a whole bunch of stuff that you probably don't want to think too much about right now. But essentially, when you open a file up, you also need to remember to close it too. Like when you're done with a Word document you've written for a class or similar, you don't keep it open forever, you close it at some point. This is very similar, except that when we're done using the file, this keyword and its accompanying block of code will handle closing the file for us.
as file_handle is us declaring a variable again. But this time without using the equals sign because this is part of the with magic syntax. Basically, when we open up the file and get the ability to do operations with it (via the handle) we'll be able to do those operations through the variable we just declared.
: the colon and the indentation underneath it is how we define the chunk of code we want to have happen while we've got a valid handle to the file. So anything we want to do with the file will be done inside of this code block. Python is a white space significant language, which means that whitespace matters, for the first timer, this means don't screw up your indentation. Code indented at the same level will run line by line until the flow of the program returns it to a higher, less indented level. More on this later.
the_first_line_in_the_file = file_handle.readline() This line should feel similar to the use of the String at first in that we've got a left side and a right side of an equals sign again. But the right side looks a bit different this time. Instead of a hardcoded value we've stated explicitly, this is using the handle to the file we've opened to fetch out single line and return that value. Think of a greasy New Yorker in an apron, he's got a big ol paddle that he's using to pull out pizzas from a brick oven. He holds onto that handle, sticks it in, then pulls out one pizza at a time. If a line is a pizza, then file_handle.readline() is us pulling our pizza out of the oven and the variable on the left side is the plate we're putting it down on.
Lastly, we use the built in function print to output the line we just pulled out of the oven to the screen. The variable we inserted the first line of the file we read into is passed as the input argument to the function, and so that's what we print out. I guess to continue the analogy, this is the waiter bringing the pizza to your table and putting it in front of you so you can see it.

Read the code again, read any of the descriptions above that don't make sense. And then let's pivot to what I said we'd be doing, thinking about what we'll do and how we'll do it.

So you loaded a file, now what?

“ Now what? ” — Steven Hawkins

Now that you've printed out the headers of the CSV, you can see we've got a number of columns we can think about.

For starters, why don't we decide to simply total up the deaths column? If you were doing this in a spreadsheet tool like Excel or Sheets, you could add in a formula, define a range, and then out pops the summary. Since we're working with a simple file, we need to massage it into a form that our program will be able to use. So let's talk about two potential options for how to represent the data.

We know that our data is comma seperated, and so a simple way to think about this would be as a list. In python, if we were hard coding our first line as a list of strings it would look like this:

headers = [ "Year", "Leading Cause", "Sex", "Race Ethnicity", "Deaths", "Death Rate", "Age Adjusted Death Rate" ]

This works, but, as a quick thought experiment I want you to tell me where in the list, without looking, the deaths column was located.

If you said the fifth item. Congrats! Now. Go for a night of drinking with some friends, dance, listen to some loud music, and now come back. Do you remember which number if was? Probably not. Not without counting again.

In code, in order for us to refer to items within a list we use what's called a subscript. It looks like this headers[4]. Subscripting is also commonly referred to as indexing into the list. The index of the list starts at 0 and then goes up one by one for each item. This means that the first thing in the list is subscript 0, and the fifth item is at 4. Confused? The reason we do this has a lot to do with how computers work underneath the hood, and how memory is laid out. Believe it or not, the 0 indexing makes a lot more sense than 1-based (despite what fans of Matlab or Lua will claim). So for now, just accept that first means 0 and we can move on.

It is annoying to remember though. So we could keep track of the number we can about in a variable and that's not a terrible idea:

deaths_column = 4
label = headers[deaths_column]

And this will fine, and reduce what's commonly called "magic numbers" in our code, but there's kind of a better way for us to deal with data and not have to keep track of a bunch of column numbers. We can use a dictionary.

If you look on a shelf for your dictionary, or I suppose navigate to a dictionary website. You can see that there's always a word and then its definition. There's a ton of other stuff (pronounciation and the like), but the sake of the analogy, ignore that and instead focus on how there's a word (key) and a definition (value). So long as you know which word to look up, you can find the meaning of it by name. Dictionaries are the same sort of thing. They look like this in code:

first_row = {
    "Year": 2007,
    "Leading Cause": "Diabetes Mellitus",
    "Sex": "M", 
    "Race Ethnicity": "Other Race/ Ethnicity", 
    "Deaths": 11,
    "Death Rate": ".", 
    "Age Adjusted Death Rate": "."
}

Where the first_row is the name of our variable, and the pairs of header name to row value is represented by a key on the left, a colon in the middle, and a value on the right. This particular map has the same type of data on each side, but that's not strictly required. We could use numbers as values, keys, or store lists of data underneath each key, whatever you like really. But most importantly, when we want to get a value out, we can do so like this:

deaths = first_row['Deaths']

That's a lot easier to read, and remember, than first_row[4] don't you think? So, if you're with me, then let's talk about how we convert the file we're read to open into a bunch of dictionaries! We're still going to have a list by the way, but rather than a bunch of strings, it's going to contain one dictionary per row of our dataset.

Luckily for us, we don't need to figure out how to deal with the funky case you can see in the 3rd row:

Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
2007,Diabetes Mellitus (E10-E14),M,Other Race/ Ethnicity,11,.,.
2010,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Not Stated/Unknown,70,.,.
2007,Cerebrovascular Disease (Stroke: I60-I69),M,Black Non-Hispanic,213,25,33
2007,Atherosclerosis (I70),F,Other Race/ Ethnicity,.,.,./

See how there's double quotes around one of the values? That's because it's got a comma inside of the value. So if we were to naively read the line into a variable named line, then do something like line.split(",") to separate the big string into little strings each time a comma showed up, we'd accidently have the wrong data. Since I11, I13, I20-I51) would all end up as individual strings in our list of words, and that doesn't line up with the actual columns anymore. Luckily for us, python has a built in CSV Reader! Better yet, it supports reading each row into a dictionary for us!

So let's give that a try and then I'll explain what the new things do:

import csv

filename = "New_York_City_Leading_Causes_of_Death.csv"
with open(filename) as file_handle:
    reader = csv.DictReader(file_handle)
    total_deaths = 0
    for row in reader:
        total_deaths += int(row['Deaths'])
    print(f"The total deaths were {total_deaths}")

We've still got the same file opening and print method usage. But there's a funny thing at the top and some new syntax. Let's break it down again:

import csv is an import statement. This is how we say we want to use libraries that are available to us. There's a lot to that, but understand that there's built in functions that are automatically imported, like print, there's built in libraries that we can use just by importing, like csv, and lastly, there's 3rd party libraries that we can install and then imported. Assuming you followed the instructions at the top of the page, you actually already did this for matplotlib. What is a library? Think of it like a phone. Your program can open up a bunch of apps and use the features in them to do stuff. The phone is your program, the apps are the libraries, and the person tapping it is you.
reader = csv.DictReader(file_handle) is a variable assignment to the name reader, and we're using the imported csv library to instantiate an instance of a Class. This looks a lot like calling a function, and we are basically doing that, the input to the creation of this thing called a class is the file_handle. For the time being, you can think of a class as a collection of data and functions that can hold onto its own personal copies of data inside of it. The common analogy is a cookie cutter, but let's talk about that later and instead focus on our immersive learning and see how it's used to get a feel for it.
for row in reader: Our first loop! We'll go into more detail in a bit, but this is doing what it sounds like it's doing. For each row in the CSV that's supplied to us via hidden away calls to get the next thing from reader, we're going to do something with the row.
total_deaths += int(row['Deaths']) We talked about dictionaries already, and as the name implies, the DictReader reads each CSV row into a dictionary, and the keys of each dictionary are the headers. So when we say row['Deaths'] we're going to pull out the String value from that column. Strings can be added together, but that combines them like "a" + "b" # gets you "ab", since we want to treat this data point as a number, we need to convert the data into one. The int function call looking thing is doing just that! Lastly, += is shorthand for "add this variable's current value to whatever I'm giving you and save it". Or, in code: a = a + 1 is the same as a += 1. So all told, this loop's body is going to sum up all the deaths from every row in the CSV as we loop through it.
Lastly, print(f"The total deaths were {total_deaths}") should look mostly familiar. We're printing some stuff out, but this time, rather than printing out some predefined text, we're going to format a String to substitute a value from the variable total_deaths into where we reference the variable. The instruction to the computer to do this substitituion, known as interpolation is from the f at the start of the string's double quotes. The funky looking {} is how the computer knows to stop using the explicit text you've written so far, and to swap to trying to replace your placeholder with the value of the variable you referenced.

So let's run the code!

Traceback (most recent call last):
File "C:\Users\..\lesson.py", line 11, in 
    total_deaths += int(row['Deaths'])
ValueError: invalid literal for int() with base 10: '.'

Our first error!

“ I see this is the commn PEBKAC problem ” — Henry VIII

Now, since you're new to this, this might look bizarre and weird and unhelpful. What the hell does it mean "ValueError"? What's a base 10? Why did it put a '.' here? Traceback? Huh? Module??? What?!

Never fear. The computer is giving you some useful information, first off, it mentioned a line number. In this case, line 11, which happens to be the line that is summing up our total death column value. Helpfully, the computer also supplied the line in question too. Basic errors will often be on the same line as that, logical errors in your program can often manifest elsewhere. So it's best to always think about what the error is telling you and what it means in the greater context of your problem than neccesarily hopping directly to the line the computer has pointed out and staring at it.

That said. The ValueError is telling us that the call we did to int is failing. There's something about an "invalid literal". The "literal" here is just talking about the string we asked the method to turn into a number, and the error message has helpfully wrapped this literal in single quotes and placed it at the end of our error message after the colon. ... with base 10: '.'.

There's not a problem with our code, rather, there's a problem with our assumptions about our data. If you open your dataset again and look at it. You'll find that the line with Atherosclerosis in it has a . in the column for deaths.

Well a dot is not a number. And so it shouldn't surprise us that the computer exploded and yelled at us. So, we need to learn about how to deal with this. In simple English, you'd probably say something along the lines of there's a dot, then don't try to turn it into a number, if there is a number though, do the thing!

Luckily for us, computers also understand basic conditional logic. So, we can add an "if statement" to our program and make it skip over the rows we don't care about:

import csv

filename = "New_York_City_Leading_Causes_of_Death.csv"
with open(filename) as file_handle:
    reader = csv.DictReader(file_handle)
    total_deaths = 0
    for row in reader:
        if row['Deaths'] == ".":
            continue
        total_deaths += int(row['Deaths'])
    print(f"The total deaths were {total_deaths}")

We've added two lines in here. If first, if row['Deaths'] == ".": is the aforementioned if statement. When you compare values, you do with Boolean expressions. Basically, you can check if things are equal to each other (what we're doing), if they're bigger, smaller, or a host of other things. In our case, we're just checking if the string data inside of the row is the same as our explicit dot.

If this is the case, then the condition we just wrote evaluates to True and so we run the block of code that's indented underneath the if statement. Which is the single word continue. This keyword tells our program that we should skip ahead to the next iteration of the loop. Which is fancy talk for "read the next row".

We could also have written this like this:

import csv

filename = "New_York_City_Leading_Causes_of_Death.csv"
with open(filename) as file_handle:
    reader = csv.DictReader(file_handle)
    total_deaths = 0
    for row in reader:
        if row['Deaths'] == ".":
            total_deaths += 0
        else:
            total_deaths += int(row['Deaths'])
    print(f"The total deaths were {total_deaths}")

If we wanted to be explicit about what happens in the positive case of "Yes, the value in the Deaths column is a dot" and be explicit what about what happens in the "else" case of that not being true (we parse the data and convert it). Since adding 0 and skipping ahead to the next line in the file is effectively the same from the perspective of the changes to the total_deaths variable, I think it's nicer to just use the continue option here.

Assuming there's no other surprises in the data, then we can try to run the code again and:

$ python lesson.py
The total deaths were 424998

Success. We now have a total for how many people died since 2007 in New York according to this data set.

But I want more.

“ A script runs on its own, an application interacts ” — Bob Saget ²

So we now have the ability to sum up a column in the data. That's nice and all, but in the "real" world there's a good chance that someone is going to come back to you after you tell them that number they asked for and then say something like, "awesome, thanks so much for that, but can you tell me..." and then list of something else they need from you. There's a lot of discipline around anticipating what a user might want, what they actually want, and what you're going to build as a programmer. Fortunately for you, since this is a blog post, all the questions are known up front!

So let's talk requirements on what sort of stuff our fictional user is going to ask us, and then figure out how to handle these scenarios one by one in a way that makes our life easier.

Our headers in the file tell us that our data is broken down by "Leading Cause", "Sex", and "Race Ethnicity" and our fictional head of department wants to know about the deaths in relation to these three columns. They're going to ask us questions like:

How many people died in 2012?
No actually, I meant 2014?
Could you give me both of those years actually?
Oh wait, no, can you break that down by sex?
How many people provided their race in this data?
Hey, what's the most prevalent cause of death by race in 2009?

And we need to answer them, and be prepared to answer any other question too because our user is probably going to change their mind. So let's talk about the kneejerk response to being asked the first question.

Our program already loops over every row in the CSV and sums things up, if we just want to know the people who died in a specific year, we could use one of those fancy if statements couldn't we? Yes! We could! That'd look like this:

import csv

filename = "New_York_City_Leading_Causes_of_Death.csv"
with open(filename) as file_handle:
    reader = csv.DictReader(file_handle)
    total_deaths = 0
    for row in reader:
        if row['Deaths'] == ".":
            continue
        if row['Year'] != '2012':
            continue
        total_deaths += int(row['Deaths'])
    print(f"The total deaths in 2012 were {total_deaths}")

And running that?

The total deaths in 2012 were 52420

So far so good... oh wait, she wanted the deaths for 2014? Well, if we just replace the two places that mention 2012 in our code with that then...

The total deaths in 2014 were 53006

Oh no, wait, she wanted both... You can see how you could spend your day tweaking, running, and then tweaking some more each time the ask from someone changes. We, as programmers, avoid this nonsense by two methods.

First, we move code that only varies by some factor into reusable chunks of code that we can pull out of our toolbelt whenever we want. Second, we make the user do it instead by tossing the whole thing somewhere they can run it and then we just take requests to extend the functionality of the program. Let's do both.

Making our own toolkit

“ UUoooarrrrraaaaaghhhhhh??? ” — Tim, the Tool Man, Taylor

We've used the open, int, and print functions so far. But those were all built in and provided to us by the standard library included in Python. When we want to make our own methods, we can define those with the def keyword. Now, I want you to think about what we just did to our code and how we ran it twice.

Between each run of the code, we changed the year, and that was it.

Ok, so since that's the only thing that's varying, lets define our entire code so far as the function that we'll call, and make the input argument (remember we mentioned those in the explanations above) be the year:

import csv

def deaths_per_year(year):
    filename = "New_York_City_Leading_Causes_of_Death.csv"
    with open(filename) as file_handle:
        reader = csv.DictReader(file_handle)
        total_deaths = 0
        for row in reader:
            if row['Deaths'] == ".":
                continue
            if row['Year'] != year:
                continue
            total_deaths += int(row['Deaths'])
        print(f"The total deaths in {year} were {total_deaths}")

deaths_per_year('2014')
deaths_per_year('2012')

Well that's certainly works. But I want you to stop and think for a moment. If I asked you to do this with a spreadsheet, would you:

Open the spreadsheet
Count the deaths in the rows with 2014 in them
Close the spreadsheet
Open the spreadsheet again
Count the deaths in the rows with 2012 in them
Close the spreadsheet

If you said yes, then get out, you're not programmer material, hell you're not even data entry material. Go, get, off with you. Go get a job that doesn't needing to do things efficiently, maybe the DMV?

Okay in all seriousness, we wouldn't do what I just described above, so why are we wasting the computers time doing it? Just like you, there are things computers can do quickly, and there are things computers do slowly. This dataset isn't particularly big so it's not a huge slow down, but if you were working with huge files and lots of data, then one of the very first things you could do is to load the file once.

Why? Because files are stored on your hard drive. Grabbing a file from the disk, versus working with something already in memory is simlar to the difference between you grabbing a drink from your fridge versus driving to a grocery store, picking and choosing your favorite flavor, waiting in line, then driving back home, carrying your groceries inside, putting them away in the fridge, and then grabbing one out and finally drinking it.

Working with files is slow for your computer. So then, lets use our cool new ability to define functions and get the data once:

import csv

def load_file_to_rows():
    filename = "New_York_City_Leading_Causes_of_Death.csv"
    output = []
    with open(filename) as file_handle:
        reader = csv.DictReader(file_handle)
        total_deaths = 0
        for row in reader:
            output.append(row)
    return output

def deaths_per_year(rows, year):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row['Year'] != year:
            continue
        total_deaths += int(row['Deaths'])
    print(f"The total deaths in {year} were {total_deaths}")

rows = load_file_to_rows()
deaths_per_year(rows, '2014')
deaths_per_year(rows, '2012')

This should all feel mostly familiar now. Go ahead and type it out and run the code. You'll see that the output is the exact same thing as before. You probably won't notice any speed up either since our dataset isn't huge, but if you're going to learn to program with the eye to make a career, then you should learn about what sort of best practices exist and when to use them.

There's a little bit of new syntax here as well. Can you spot it? Take a second to do that, then read the explanations below:

for row in rows this isn't actually new syntax at all. This is the same sort of thing you saw with the reader from before. The reason the only thing that changes is the name is because both rows and reader are iterable. Which just means that you can pull out each item from the variables contents one at a time with the for thing in iterable syntax.
What is that rows thing though? output = [] defines a list. Much like when I listed off the headers in the examples above, we use square brackets to define a list's start and stop. In this case, we want to start with an empty list and then add things to it.
We add to the list by calling output.append(row). Much like when we called reader.readline(), we're working with a single instance of a list, and we're calling methods on our list variable in order to modify the data inside of it.

I want to highlight now what I brushed over before about classes. You see, a list and a number like 1 have a key difference. A number is a primitive type that's part of the language itself. It is what it is, and when you have 1 you have the value 1 and nothing else. But, when you have a list, that's something that contains both data and behavior. We can ask the list to sort itself, clear out all the stuff inside of it, or as you saw in the above code, add more data to it.

This sort of thing, where you can think about having some sort of wrapper around data that exposes useful methods to manipulate it is what makes up the fundamentals of a class. You define those functions that operate on that data once. And every instance of a class can use them. But every instance of a class might have different data inside.

a = []
b = [1,2,3]
c = b.copy()

The three lists above are all separate from each other, but you can do things like .copy() or .sort() on them and they'll all change the data inside of them independently of each other. This is why I said before that the common analogy is that a class is a cookie cutter. You have a shape and you can stamp out a bunch of individual cookies. They're all the same type of cookie (Class), but they're each their own unique little thing.

Anyway, classes are one of those things that can take a while to "click" for people. So if it doesn't quite make sense yet, don't worry. It will. There's always more examples, explanations, and tutorials to read until it does. For now, let's keep up our immersion and circle back around to the asks:

How many people died in 2012?
No actually, I meant 2014?
Could you give me both of those years actually?
Oh wait, no, can you break that down by sex?
How many people provided their race in this data?
Hey, what's the most prevalent cause of death by race in 2009?

Right. We want the deaths for both of those years. The text is a bit ambigious though. Does our fictitious user want the two years separately? Or combined? Why don't we do both? If we want to be able to count deaths for more than one year, than we could pass more arguments into the methods...

def deaths_per_year(rows, year1, year2)

But what would you pass for the second year when you're getting counts for just one year? An empty string? A bogus year? How would this effect the if statement we wrote?

if row['Year'] == year1 or row['Year'] == year2:

Well, we'd learn some new boolean conditional keywords this way with or, and and not, but imagine the headache of whenever someone asks you to check for one more year and then updating this code. It'd be a pain, and it raises a lot of questions about what our program will end up looking like (hint: not good). So let's instead use that class we just learned about. what if we pass a list of years? Well, we'd handle the case where we want to know about just one year by only passing in a list with one thing in it. For multiple years, just a list with multiple items in it... this sounds good. What about the if condition though?

def deaths_per_years(rows, years):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row['Year'] in years:
            total_deaths += int(row['Deaths'])
    print(f"The total deaths in {years} were {total_deaths}")

rows = load_file_to_rows()
deaths_per_years(rows, ['2014'])
deaths_per_years(rows, ['2012'])
deaths_per_years(rows, ['2012', '2014'])

The in operator is pretty powerful. We can use it not just to express the looping behavior we've been using, but also to tell the computer to check if anything inside of a list matches the value we've mentioned on the left side of the in. When we run this, you can see that the way the text is being formatted looks a bit funny though:

The total deaths in ['2014'] were 53006
The total deaths in ['2012'] were 52420
The total deaths in ['2012', '2014'] were 105426

This is the default way python prints out a list. If we don't like that though, we can take advantage of the some built in methods (as noted in the documentation) to format it a bit nicer:

joined_string = ', '.join(years)
print(f"The total deaths in {joined_string} were {total_deaths}")

Which will print out:

The total deaths in 2014 were 53006
The total deaths in 2012 were 52420
The total deaths in 2012, 2014 were 105426

Nice. Let's move onto the next requirement we've got.

Oh wait, no, can you break that down by sex?

Assuming that your boss doesn't want you to get freaky on the dance floor, we're going to have to massage the data to filter the rows we care about when doing that count. Rather than get overly clever with it, let's do the basic thing first. Copy, paste, rename.

def male_deaths_per_years(rows, years):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row['Sex'] == 'F':
            continue
        if row['Year'] in years:
            total_deaths += int(row['Deaths'])

    print(f"The total male deaths in {', '.join(years)} were {total_deaths}")

def female_deaths_per_years(rows, years):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row['Sex'] == 'M':
            continue
        if row['Year'] in years:
            total_deaths += int(row['Deaths'])
    print(f"The total female deaths in {', '.join(years)} were {total_deaths}")

As a good general rule. It's nice to do the easy thing first, then abstract. When you're young and getting started with programming, it's best to keep things simple and do this. When you're old and seen enough requirements to anticipate what's needed next, then you can decide if you want to leverage doing things the more "clever" way first.

That said, every line of code you write is a line of code you have to maintain. So copying and pasting a method, then adding 2 lines to each, should set off some alarm bells in your head. Hopefully the alarm is going off this is just like when we needed to collect data by year!

So, let's ditch these hyper specific methods and instead add a new input argument to our old reliable. But this time, let's do something slighly different. We know that for this 4th requirement that they want the deaths broken down by sex, but for the next two, they don't care about the sex at all (much like a strangely large portion of the dating pool). So, let's make this next input optional Here's how we do that:

def deaths_per_years(rows, years, sex = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if sex is not None and row['Sex'] != sex:
                continue
        if row['Year'] in years:
            total_deaths += int(row['Deaths'])

    output = f"The total deaths in {', '.join(years)}"
    if sex is not None:
        output += f" for sex {sex}"
    output += f" were {total_deaths}"
    print(output)

As per usual, type this out yourself. Read the code as you do, and think a little about the new stuff we just added. Are you building an intuition for what it all means? No? That's okay, it'll come to you in time and as you get expose to more and more programming languages, things will feel right at home even when you look at a language you've never seen before ³.

There's two (maybe three-ish) new things here. Let's take them in order from top to bottom:

sex = None Get your giggle out now. Good? Cool, so when we say argumennt = something in the list of arguments for a function we're setting a default for it. This means that our existing calls to the deaths_per_years method will continue to work. This is nice. It means that we don't have to go fix existing code (imagine if you called this function from 30 different files!) and that we can continue on to where we're using this potential value.
if sex is not None and row['Sex'] != sex We've already mentioned that and and not were keywords for doing conditional logic. And now you get to see them in practice! Isn't that nice? There's also this new and mysterious is keyword. Which, much like Bill Clinton, you might be wondering what is is. Remember how we talked about classes before? And how we have instances of each? Or, they're cookies that have been cut from the same Class cloth? Well, is checks to see if two cookies are really the same cookie. It allows us to check if a variable is using the same object underneath the hood, not just the same value. If you'd like to see this in action, go ahead and type print([] == [], [] is []) in your program and run it. You'll see that it prints out True, False. If you remember that lists are wrappers around their own copies of data like I said before, this makes sense. The empty lists have the same value (nothing), but they are still two different lists.
output += f" for sex {sex}" I mentioned this before, that you can combine strings together with the + operator. Well, much like our running count in our numeric variable, we can use += to append new strings to an existing one. Which is pretty handy because we can choose to include the extra information about sex in the output string only if it's actually being used.

Okay, let's call the code now:

rows = load_file_to_rows()
deaths_per_years(rows, ['2014'])
deaths_per_years(rows, ['2012'])
deaths_per_years(rows, ['2012', '2014'])
deaths_per_years(rows, ['2012'], sex = 'M')
deaths_per_years(rows, ['2012'], sex = 'F')
deaths_per_years(rows, ['2012'], sex = None)

and the result:

The total deaths in 2014 were 53006
The total deaths in 2012 were 52420
The total deaths in 2012, 2014 were 105426
The total deaths in 2012 for sex M were 25654
The total deaths in 2012 for sex F were 26766
The total deaths in 2012 were 52420

I included the last line to show you that our default value of None really does work, since you can see that line 2 and the last line produce the same number of deaths.

We're getting pretty good at this programming thing don't you think. We can read data, we can filter data, we're even making our lives easier with optional parameters! Let's tackle the next ask from our fictional boss. I'll copy over the list of requirements again:

How many people died in 2012?
No actually, I meant 2014?
Could you give me both of those years actually?
Oh wait, no, can you break that down by sex?
How many people provided their race in this data?
Hey, what's the most prevalent cause of death by race in 2009?

How many people provided their race in this data? In order to figure out how to handle this we kind of need to know what sort of races are even available to us. Is this another no deaths, put a . situation or something? Let's find out with a new function.

But before we code. If I gave you a physical ledger, and asked you to tell me the possible values I could find in this race column, what would you do? In real life there's no button in a dropdown to see unique values after all. I would guess, that you'd probably take out a new piece of paper, then start going down line by line on the ledger. Whenever you saw a value, you'd see if it was on your second piece of paper, and if not you'd add it. If it was, you'd continue on.

Turns out, we can get the computer to do this too. And we can do it using the stuff we already learned!

def unique_values_in_column(rows, column_name):
    values = []
    for row in rows:
        if row[column_name] in values:
            continue
        else:
            values.append(row[column_name])
    return values

Nothing in this code is new. Check it out. You've seen function definitions before, you've seen us iterate over the rows of data before. You've seen if else in and continue. And you've seen return used to return the rows of data from the file for use outside of the function. Let's use this and see what the results are:

races = unique_values_in_column(rows, 'Race Ethnicity')
print(races)

and the print call outputs...

[
    'Other Race/ Ethnicity', 
    'Not Stated/Unknown', 
    'Black Non-Hispanic', 
    'White Non-Hispanic', 
    'Asian and Pacific Islander', 
    'Hispanic'
]

So we know our options now which is handy. So now we can use this to figure out how many deaths didn't have a known race. All we have to do is count up the total number of deaths that have that race in them. Assuming that "this data" means all the data and not just one of the years, then we could write this in a helper function like this:

def count_deaths_where_column_is(rows, column_name, is_this_value):
    count = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row[column_name] == is_this_value:
            count += int(row['Deaths'])
    return count

where once again, nothing is new, just another application of what we've learned. Calling it tells us that there are 4099 deaths that weren't classified by race.

count = count_deaths_where_column_is(rows, 'Race Ethnicity', 'Not Stated/Unknown')
print(count) # prints 4099

And that answers the question we're being asked. But... Take a look at this code. Now look at the deaths_per_years code again. Do you notice anything? Compare them line by line, don't they feel sort of similar? They should.

We've got the same death dot check, the same loop, and just like sex, we're checking out the value of a column against an input. Knowing that we'll probably be asked tomorrow about one of the specific races, we can generalize our ideas here and tweak our usual method into being able to handle filtering down by a specific search criteria:

def deaths_per_years(rows, years, sex = None, column_is_value = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue

        if sex is not None and row['Sex'] != sex:
            continue
        
        if column_is_value is not None:
            if row[column_is_value['column']] != column_is_value['value']:
                continue
        
        if row['Year'] in years:
            total_deaths += int(row['Deaths'])

    output = f"The total deaths in {', '.join(years)}"

    if sex is not None:
        output += f" for sex {sex}"

    if column_is_value is not None:
        c = column_is_value['column']
        v = column_is_value['value']
        output += f" for column {c} = {v}"

    output += f" were {total_deaths}"
    print(output)

column_is_value = {'column': 'Race Ethnicity', 'value': 'Not Stated/Unknown'}
deaths_per_years(rows, ['2012'], None, column_is_value)

This prints off

The total deaths in 2012 for column Race Ethnicity = Not Stated/Unknown were 541

If we didn't pass None as our 3rd input and instead specified a sex then:

The total deaths in 2012 for sex M for column Race Ethnicity = Not Stated/Unknown were 314

The only thing new in this code is that we're using a dictionary as an input. We've been using the dictionary the CSV reader provided us, but this is the first time we declared our own! However, once again, I want you to squint at this code. What do you notice?

Do you happen to notice that the check for Sex is eerily similar to our new generic column check? Sure, there's some extra indenting as we've nested an if statement inside of another one, but if you were to pass in {'column': 'Sex', 'value': 'M'} wouldn't it just be doing the exact same checks?

Yes it would. So. Rather than having the column_is_value be singular. Let's go ahead and make it a list so that we can provide multiple columns at once to be checked! This will give us a more flexible interface to our program, able to adapt to the shifting demands of "the business", and more importantly, give us less code to maintain.

def deaths_per_years(rows, years, columns_matching_values = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if columns_matching_values is not None:
            should_skip = False
            for criteria in columns_matching_values:
                if row[criteria['column']] not in criteria['values']:
                    should_skip = True
            if should_skip:
                continue

        if row['Year'] in years:
            total_deaths += int(row['Deaths'])

    output = f"The total deaths in {', '.join(years)}"
    if columns_matching_values is not None:
        for criteria in columns_matching_values:
            c = criteria['column']
            v = ', '.join(criteria['values'])
            output += f" for column {c} = {v}"
    output += f" were {total_deaths}"
    print(output)

There's only one new thing I want to call attention to here. should_skip is used to setup what's commonly called a sentinel value. You might also here it called a flag. If you remember from the explanation of how continue skips ahead in whatever we're looping over, it well... well it skips the loop. But, when we're looping across the possible columns we're filtering by, we're inside of a loop inside of a loop. So if we tried to use continue in there, we'd just skip the column we just checked. Which would be pointless.

So instead, we setup a little flag for ourselves, do our inner loop, and then check to see if anyone killed that canary in this particular coal mine. If we found an example of the row not including a value we're looking for, then we know we shouldn't count the deaths in the row.

The other thing that has changed significantly here is the function signature. You may remember that all I mean by that is the inputs to the function. No more are we sexless! Erm. I mean, we've removed sex = None from the arguments. And instead, replaced it with columns_matching_values = None. This means we've got to update our inputs. The price we pay for a more powerful and flexible way of querying the data is a slightly more complicated input:

rows = load_file_to_rows()
deaths_per_years(rows, ['2014'])
deaths_per_years(rows, ['2012'])
deaths_per_years(rows, ['2012', '2014'])
criterias = [
    {'column': 'Sex', 'values': ['M']}
]
deaths_per_years(rows, ['2012'], criterias)
criterias = [
    {'column': 'Sex', 'values': ['F']}
]
deaths_per_years(rows, ['2012'], criterias)
criterias = [
    {'column': 'Sex', 'values': ['M']}
]
deaths_per_years(rows, ['2012'], columns_matching_values = None )
count = count_deaths_where_column_is(rows, 'Race Ethnicity', 'Not Stated/Unknown')
print(count)
criterias = [
    {'column': 'Sex', 'values': ['M'] },
    {'column': 'Race Ethnicity', 'values': ['Not Stated/Unknown'] }
]
deaths_per_years(rows, ['2012'], criterias)

and the usual prints out:

The total deaths in 2014 were 53006
The total deaths in 2012 were 52420
The total deaths in 2012, 2014 were 105426
The total deaths in 2012 for column Sex = M were 25654
The total deaths in 2012 for column Sex = F were 26766
The total deaths in 2012 were 52420
4099
The total deaths in 2012 for column Sex = M for column Race Ethnicity = Not Stated/Unknown were 314

But we're not done yet! Since we're taking in multiple values for the possible column values, we can ask ourselves the same question we did before when we fiddled with the Sex column! And by jove, the year check itself is just another column we could be checking with our new criteria model! All we have to do is delete the condition checking the years and move the actual summation of the Deaths out one indentation level so that we do it if the criteria didn't skip the row:

def deaths_per_years(rows, columns_matching_values = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if columns_matching_values is not None:
            should_skip = False
            for criteria in columns_matching_values:
                if row[criteria['column']] not in criteria['values']:
                    should_skip = True
            if should_skip:
                continue

        total_deaths += int(row['Deaths'])

    output = f"The total deaths"
    if columns_matching_values is not None:
    for criteria in columns_matching_values:
        c = criteria['column']
        v = ', '.join(criteria['values'])
        output += f" in column {c} = {v}"
    output += f" were {total_deaths}"
    print(output)

And the output is a little rougher around the edges, but still understandable for anyone who might be querying this dataset. I've tweaked the inputs a bit to give you some samples of how you can update a list in place, as well as the values of the dictionaries themselves:

rows = load_file_to_rows()

# Explicity set the map like so:
criterias = [
    {'column': 'Year', 'values': ['2014']}
]
deaths_per_years(rows, criterias)

# A bit brittle, but okay when localized:
criterias[0]['values'] = ['2012']
deaths_per_years(rows, criterias)
criterias[0]['values'] = ['2012', '2014']
deaths_per_years(rows, criterias)

# add new criteria to the existing setup with append:
criterias[0]['values'] = ['2012']
criterias.append( {'column': 'Sex', 'values': ['M']} )
deaths_per_years(rows, criterias)

# change M to F for the second criteria we're searching by:
criterias[1]['values'] = ['F']
deaths_per_years(rows, criterias)

# Or just set it to a brand new dictionary:
criterias = [
    {'column': 'Year', 'values': ['2012']},
    {'column': 'Sex', 'values': ['M']}
]
deaths_per_years(rows, columns_matching_values = None )

# You can see how this is pretty powerful:
criterias = [
    {'column': 'Year', 'values': ['2012']},
    {'column': 'Sex', 'values': ['M'] },
    {'column': 'Race Ethnicity', 'values': ['Not Stated/Unknown'] }
]
deaths_per_years(rows, criterias)

And this still prints out the usual numbers, which helpfully tells us we haven't broken anything yet!

The total deaths in column Year = 2014 were 53006
The total deaths in column Year = 2012 were 52420
The total deaths in column Year = 2012, 2014 were 105426
The total deaths in column Year = 2012 in column Sex = M were 25654
The total deaths in column Year = 2012 in column Sex = F were 26766
The total deaths were 424998
The total deaths in column Year = 2012 in column Sex = M in column Race Ethnicity = Not Stated/Unknown were 314

Now this is fun, but let's see if we can use it to answer the final question we've got:

Hey, what's the most prevalent cause of death by race in 2009?

Last time we did a copy paste tweak strategy. This time let's take a step back and think about what we've learned so far and try to see if we can make our job easier. We now have a way to filter the rows we want to count death counts for, and then the function we're calling prints out the information about it. We also made a method for ourselves to find out what unique values were available to us for a particular column. We've seen, via the inputs to the newest version of the function, that we can change dictionary values after they've been defined pretty easily. So, let's circle back to reality again.

If you need to know the most prevalent cause of death, then you need to know the counts for each type of death. Once you have that, you just need to get the biggest one. Well... we know how to get part of this information:

causes_of_death = unique_values_in_column(rows, 'Leading Cause')
print(causes_of_death)

Will print out a ton of stuff:

['Diabetes Mellitus (E10-E14)', 'Diseases of Heart (I00-I09, I11, I13, I20-I51)', 'Cerebrovascular Disease (Stroke: I60-I69)', 'Atherosclerosis (I70)', 'Malignant Neoplasms (Cancer: C00-C97)', 'Chronic Lower Respiratory Diseases (J40-J47)', 'Intentional Self-Harm (Suicide: X60-X84, Y87.0)', 'All Other Causes', 'Septicemia (A40-A41)', 'Chronic Liver Disease and Cirrhosis (K70, K73)', 'Influenza (Flu) and Pneumonia (J09-J18)', 'Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)', 'Mental and Behavioral Disorders due to Accidental Poisoning and Other Psychoactive Substance Use (F11-F16, F18-F19, X40-X42, X44)', 'Essential Hypertension and Renal Diseases (I10, I12)', "Alzheimer's Disease (G30)", 'Human Immunodeficiency Virus Disease (HIV: B20-B24)', 'Assault (Homicide: Y87.1, X85-Y09)', 'Congenital Malformations, Deformations, and Chromosomal Abnormalities (Q00-Q99)', 'Nephritis, Nephrotic Syndrome and Nephrisis (N00-N07, N17-N19, N25-N27)', 'Insitu or Benign / Uncertain Neoplasms (D00-D48)', 'Certain Conditions originating in the Perinatal Period (P00-P96)', 'Viral Hepatitis (B15-B19)', 'Mental and Behavioral Disorders due to Use of Alcohol (F10)', 'Tuberculosis (A16-A19)', 'Aortic Aneurysm and Dissection (I71)', "Parkinson's Disease (G20)"]

Now. I want you to think again about if you were doing this by hand with a piece of paper. You'd write each cause down as you encountered it, then maybe have a tally next to each cause that you'd add to each time you saw it show up in the data. That sounds an awful like a key to value relationship doesn't it? We've learned a useful way to track that kind of data, so why don't we do something like this:

death_by_count = {}
for cause in causes_of_death:
    death_by_count[cause] = 0

The {} is an empty dictionary. And since we have a list of causes, we can loop over them in the usual way to get each one and initialize the count to 0. Once we've got that, we just need a count! We already have a querying method, so if we just re-use that then we'll get something sort of close:

causes_of_death = unique_values_in_column(rows, 'Leading Cause')
death_by_count = {}
for cause in causes_of_death:
    death_by_count[cause] = 0
    criterias = [
        {'column': 'Leading Cause', 'values': [cause]},
        {'column': 'Year', 'values': ['2009']}
    ]
    deaths_per_years(rows, criterias)

But running this is going to print out a whole bunch of stuff, including things like:

The total deaths in column Leading Cause = Parkinson's Disease (G20) in column Year = 2009 were 0

Which is very hard to read next to a bunch of other lines of text. If we want to filter out the stuff that doesn't matter, and more importantly, only display the leading cause of death for that year, then we need to separate the way we query and get our data from the way we print our data. Again, we're not doing anything new here. So if you want to give this a shot yourself, please do! Simply separate the print related lines from the filtering related lines and update one of the methods to actually return something.

def deaths_per_years(rows, columns_matching_values = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if columns_matching_values is not None:
            should_skip = False
            for criteria in columns_matching_values:
                if row[criteria['column']] not in criteria['values']:
                    should_skip = True
            if should_skip:
                continue

        total_deaths += int(row['Deaths'])
    return total_deaths

def print_deaths(total_deaths, columns_matching_values = None):
    output = f"The total deaths"
    if columns_matching_values is not None:
        for criteria in columns_matching_values:
            c = criteria['column']
            v = ', '.join(criteria['values'])
            output += f" in column {c} = {v}"
    output += f" were {total_deaths}"
    print(output)

Of course, our calling code needs to change again if we want the same output as before. To variations of this:

print_deaths(deaths_per_years(rows, criterias), criterias)

But more importantly. With a method to get the counts, we can finish answering the last question:

causes_of_death = unique_values_in_column(rows, 'Leading Cause')
largest_death_count = 0
largest_cause = '?'
for cause in causes_of_death:
    criterias = [
        {'column': 'Leading Cause', 'values': [cause]},
        {'column': 'Year', 'values': ['2009']}
    ]
    death_count = deaths_per_years(rows, criterias)
    if death_count > largest_death_count:
        largest_cause = cause
        largest_death_count = death_count

print(f"The largest cause of death in 2009 was {largest_cause} with {largest_death_count}")

Unsurprisingly, given how much good food there is in NYC. The leading cause of death was:

The largest cause of death in 2009 was Diseases of Heart (I00-I09, I11, I13, I20-I51) with 20084

And that's all the questions answered! But you know what's better than having a handy script that can let you answer people's questions in a near magical way? That's right.

Make them do it themselves

“ Done right, you want? Do self, you will. ” — Anakin Skywalker

So you're immersed in the world of lists, dictionaries, strings, if statements, loops. You've got a pretty decent kit of stuff you can expand on here. But the best thing is always to free your time up for learning by making the user get their own data. Let's assume that this person who keeps pestering you for information is somewhat technical, they've got python installed on their machine and they even know how to use the command line interface a bit! ⁴

Good. That means we can get into dealing with some basic user input. Heck, even if this savvy business person doesn't use the tool, if you make it easier for you to interface with, then that's still a win. Rather than write the function calls ourselves, let's use input to get information from the user and then do stuff with it.

if you enter user_input = input("What's your favorite color? ") and run the program, you'll be greeted by the string you gave to the input function sitting on your screen. And unlike before, your script doesn't stop. It sits there. Waiting for something. Well, it's waiting for you to type something in and press enter. Once you do, the user_input variable is going to contain the text you entered as its value.

You can probably see how this will be useful.

We can compare the value the user enters against a potential range of things, and then call our little querying system based on that. In order to do that, we'll need a different kind of loop though.

criteria = []
while True:
    print(f"Current criteria: {criteria}")
    user_input = input("Reset criteria? [Y/N] ")
    if user_input not in ['Y', 'N', 'y', 'n']:
        print("You must enter Y or N")
        continue

    if user_input in ['Y', 'y']:
        criteria = []

    year_input = input("Would you like to filter by a year? [Y/N] ")
    if year_input in ['Y', 'y']:
        year = input("Enter the year and press enter: ")
        criteria.append({ 'column': 'Year', 'values': [year] })

    sex_input = input("Would you like to filter by sex? [Y/N] ")
    if sex_input in ['Y', 'y']:
        sex = input("Enter the sex [M/F] to filter by and press enter: ")
        if sex not in ['M', 'F']:
            print("invalid sex input!")
            continue

        criteria.append({ 'column': 'Sex', 'values': [sex] })

    race_input = input("Would you like to filter by race? [Y/N] ")
    if race_input in ['Y', 'y']:
        race = input("Enter the race to filter by and press enter: ")
        if race not in unique_values_in_column(rows, 'Race Ethnicity'):
            values = unique_values_in_column(rows, 'Race Ethnicity')
            print(f"Invalid value! possible values are {values}")
            continue

        criteria.append({ 'column': 'Race Ethnicity', 'values': [race] })


    print("Searching...")

    death_count = deaths_per_years(rows, criteria)
    print_deaths(death_count, criteria)

    are_we_done = input("Would you like to quit? [Y/N]: ")
    if are_we_done in ['Y', 'y']:
        break

There's a couple new things to point out here. And a bunch of improvements I'm leaving to you to do as an exercise. First off, the new stuff:

while True: Congratulations. This is your first "infinite" loop. We will loop forever because the while keyword indicates that we should repeat its block of code until a condition evaluates to False. Given that we've hardcoded the condition to True, you can see it will go on and on and on and on...
some_input not in [...] the variations on checking our user input uses a list of values that have a meaning of some kind. Specifically, we ask the user to enter either Y or N and then use the power of if statements to either skip to the next round of the loop, or to carry on playing 20 questions.
break this is another new keyword. Similar to continue this controls the flow of the program when we're inside of a loop. You can probably guess, but if continue continues the next run of the loop, then break probably... that's right. It breaks us out of the loop. In other words, our "infinite" loop isn't actually infinite. It will stop once we hit the break code when we ask the user to quit and they type a Y in.

And that's it. If we run the code and say no to everything we'll get the total deaths:

Current criteria: []
Reset criteria? [Y/N] n
Would you like to filter by a year? [Y/N] n
Would you like to filter by sex? [Y/N] n
Would you like to filter by race? [Y/N] n
Searching...
The total deaths were 424998
Would you like to quit? [Y/N]: Y

And if we actually do some filtering?

Current criteria: []
Reset criteria? [Y/N] n
Would you like to filter by a year? [Y/N] y
Enter the year and press enter: 2012
Would you like to filter by sex? [Y/N] y
Enter the sex [M/F] to filter by and press enter: F
Would you like to filter by race? [Y/N] n
Searching...
The total deaths in column Year = 2012 in column Sex = F were 26766
Would you like to quit? [Y/N]: y

And unsurprisingly, since it's using the same to code to query as before, we get the same count of results! Lovely.

What's not lovely of course is that I've left a number of bugs in here for you to have fun with and try to puzzle out. You should have everything you need to know in order to figure out how to do them, and so read the post again, or do some google fu, and struggle, toil, and figure out how to fix these. I would not recommend using an AI to figure it out, an AI will rob you of the way information that's been found after a struggle tends to stick in your head. Very useful for quick stuff you already know how to do. Great even for explaining concepts you're fuzzy on. But useless when it comes to making sure you know how to actually reason and understand your own programs.

Exercise for the reader

“ Exercises for the reader are my way of saying this blog post is too long and I want do other things ” — Me

So here's some bugs I know are in the last bit of code I just showed you, do your best to figure out how to fix them and you'll be on your way to success!

We repeatedly ask the user to say yes or no, can you figure out a way to not have to write that so many times?
Update the possible accepted inputs to include Yes and No spelled out.
Can you lookup string functions in the python documentation to figure out a way to ignore the case of the letters during that input?
On the first pass through the loop, can you ignore the question to reset the criteria?
If you answer no to quiting and enter a criteria you've already entered before, does the search work?
Can you extend the program to query for other columns based on user input?
Can you extend the program to not only give death counts, but ask if the user wants to run some of the other functions?
Can you make it possible to answer all the business questions through the interactive input?
Can you validate the user input without skipping ahead to the next cycle of the loop?

If you can knock out a few of these, especially the major bug you can replicate with the 5th exercise, you should be on your way to success. I hope that this was somewhat interesting to any one new to programming, and that it's more encouraging than discouraging. For experienced programmers, perhaps you disagree with my path through to the "solution", perhaps you'd have done it a different way. Your exercise is to go write a blog post yourself about it so I can read it and learn something new.

Have fun!

The full code, for reference

import csv

def load_file_to_rows():
    filename = "New_York_City_Leading_Causes_of_Death.csv"
    output = []
    with open(filename) as file_handle:
        reader = csv.DictReader(file_handle)
        total_deaths = 0
        for row in reader:
            output.append(row)
    return output

def unique_values_in_column(rows, column_name):
    values = []
    for row in rows:
        if row[column_name] in values:
            continue
        else:
            values.append(row[column_name])
    return values

def count_deaths_where_column_is(rows, column_name, is_this_value):
    count = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if row[column_name] == is_this_value:
            count += int(row['Deaths'])
    return count

def deaths_per_years(rows, columns_matching_values = None):
    total_deaths = 0
    for row in rows:
        if row['Deaths'] == ".":
            continue
        if columns_matching_values is not None:
            should_skip = False
            for criteria in columns_matching_values:
                if row[criteria['column']] not in criteria['values']:
                    should_skip = True
            if should_skip:
                continue

        total_deaths += int(row['Deaths'])
    return total_deaths

def print_deaths(total_deaths, columns_matching_values = None):
    output = f"The total deaths"
    if columns_matching_values is not None:
        for criteria in columns_matching_values:
            c = criteria['column']
            v = ', '.join(criteria['values'])
            output += f" in column {c} = {v}"
    output += f" were {total_deaths}"
    print(output)


rows = load_file_to_rows()
criterias = [
    {'column': 'Year', 'values': ['2014']}
]
print_deaths(deaths_per_years(rows, criterias), criterias)
criterias = [
    {'column': 'Year', 'values': ['2012']}
]
print_deaths(deaths_per_years(rows, criterias), criterias)
criterias = [
    {'column': 'Year', 'values': ['2012', '2014']}
]
print_deaths(deaths_per_years(rows, criterias), criterias)
criterias = [
    {'column': 'Sex', 'values': ['M']},
    {'column': 'Year', 'values': ['2012']}
]
print_deaths(deaths_per_years(rows, criterias), criterias)
criterias = [
    {'column': 'Sex', 'values': ['F']},
    {'column': 'Year', 'values': ['2012']}
]
print_deaths(deaths_per_years(rows, criterias), criterias)
criterias = [
    {'column': 'Year', 'values': ['2012']},
    {'column': 'Sex', 'values': ['M']}
]
print_deaths(deaths_per_years(rows, columns_matching_values = None ), None)
races = unique_values_in_column(rows, 'Race Ethnicity')
print(races)
count = count_deaths_where_column_is(rows, 'Race Ethnicity', 'Not Stated/Unknown')
print(count)
criterias = [
    {'column': 'Year', 'values': ['2012']},
    {'column': 'Sex', 'values': ['M'] },
    {'column': 'Race Ethnicity', 'values': ['Not Stated/Unknown'] }
]
print_deaths(deaths_per_years(rows, criterias), criterias)



causes_of_death = unique_values_in_column(rows, 'Leading Cause')
largest_death_count = 0
largest_cause = '?'
for cause in causes_of_death:
    criterias = [
        {'column': 'Leading Cause', 'values': [cause]},
        {'column': 'Year', 'values': ['2009']}
    ]
    death_count = deaths_per_years(rows, criterias)
    if death_count > largest_death_count:
        largest_cause = cause
        largest_death_count = death_count

print(f"The largest cause of death in 2009 was {largest_cause} with {largest_death_count}")

criteria = []
done = False
while not done:
    print(f"Current criteria: {criteria}")
    user_input = input("Reset criteria? [Y/N] ")
    if user_input not in ['Y', 'N', 'y', 'n']:
        print("You must enter Y or N")
        continue

    if user_input in ['Y', 'y']:
        criteria = []

    year_input = input("Would you like to filter by a year? [Y/N] ")
    if year_input in ['Y', 'y']:
        year = input("Enter the year and press enter: ")
        criteria.append({ 'column': 'Year', 'values': [year] })

    sex_input = input("Would you like to filter by sex? [Y/N] ")
    if sex_input in ['Y', 'y']:
        sex = input("Enter the sex [M/F] to filter by and press enter: ")
        if sex not in ['M', 'F']:
            print("invalid sex input!")
            continue

        criteria.append({ 'column': 'Sex', 'values': [sex] })

    race_input = input("Would you like to filter by race? [Y/N] ")
    if race_input in ['Y', 'y']:
        race = input("Enter the race to filter by and press enter: ")
        if race not in unique_values_in_column(rows, 'Race Ethnicity'):
            values = unique_values_in_column(rows, 'Race Ethnicity')
            print(f"Invalid value! possible values are {values}")
            continue
        
        criteria.append({ 'column': 'Race Ethnicity', 'values': [race] })


    print("Searching...")

    death_count = deaths_per_years(rows, criteria)
    print_deaths(death_count, criteria)

    are_we_done = input("Would you like to quit? [Y/N]: ")
    if are_we_done in ['Y', 'y']:
        break

Learning programming by immersion