# here is where I would like a block image to go for each lesson. Imagery is important to engage the learner.

“Ugh, why do we have to write this in that language?”

Whether you’re groaning over having to translate your code into some other language, or arguing profusely over the benefits of some obscure package, it’s likely that you’ve considered the differences between programming languages before. Perhaps you’re such a language switcher that your semicolons get mixed up with your whitespace and you search for way too long for the switch statement that doesn’t exist in Python. Perhaps you’re committed to your language and never touching anything else, because yours is clearly superior. But whatever the case, we are going to explore some differences between languages today.

What does it mean to be “different”?

One of the most common statistical tools is to compare samples in order to detect a significant difference. This might mean detecting a difference in their averages, or more formally, detecting that they are drawn from truly different generating processes in the world. More on that: whenever we draw a sample, we are trying to capture a sample of some phenomenon happening; there is some process (that we can’t quite know) that dictates how that data turns out. In this example. it could be that there is something inherent about Python vs. JavaScript interpreters that make them actually different. Or perhaps it could be that Python users are different from JavaScript users, with one tending to be more verbose than the other. We are trying to draw samples in order to capture patterns in the world, and those patterns come from some causal relationship. With comparisons, we really cannot determine causality, but this is to get us thinking about the “why” when we do see a difference afterall. If we see a difference in average number of lines for Python vs. JavaScript, is there something inherent about writing Python code that lends nicely to one liners? Or is it the population of Python users? Or is it the specific Python code we grabbed? These are all questions we need to think about when we hear someone say “Oh please, use this language instead! It’s way better at… blah blah blah.” So let’s get to the bottom of those “blah blah blah” claims, shall we?

How does the number of lines relate to the number of function definitions?

You’re faced with our question: How does the number of lines of code compare with the number of function definitions between languages? Immediately, I think about the many times I’ve seen code that copy-and-pastes the same block of code over and over, without declaring it as a function. (Remember your coding etiquette, people! If you copy and paste 3 times, it’s time for a function.) But I’m also reminded of the massive code libaries, containing a function for everything, with hundreds of lines of code for all of those functions. Try to reflect on your own intuition about if the number of lines has a relationship to the number of function declarations in a program. Maybe take a look at a program you’ve written: how many lines? how many function declarations? And did it change over the duration of your first programming course? (I personally went from copy-and-pasting a hard-coded mess into neat little helper functions as I had to face my own monsterous and unmaintainable code.) But wouldn’t that shorten my program length? You can imagine that the larger the program, the more function definitions there are. But you can also imagine just the opposite.

How do I parse a programming language? Abstract Syntax Trees

We need to parse a program in order to count up the number of function definitions. The most straightforward way of doing this might be to look for the “def” keyword in a language like Python. Imagine you were writing this yourself; trying to count the number of function definitions in a piece of code. You might rely on good old fashioned string matching: for line in program: if "def" in line: function_count +=1 That’s a fine start but it gets very messy, fast. Luckily, we can rely on something called an Abstract Syntax Tree. In this example, we use the ast package for Python and acorn.js for JavaScript. TODO: lobstr for R These parsers give us easy access to the different components of a program, and help us to count them up easier. The example scripts are located here (Python) and here (JavaScript).

Load Your Libraries

library(ggplot2)
library(dplyr)

Read in Your Data

py = read.csv("../data/PythonFiles/Stats/python_stats.csv")
colnames(py) <- c("Lines","FunctionDefs")
length(py$Lines)
## [1] 238
#240 python programs
py$Language = "Python"

# JS programs
js = read.csv("../data/JSFiles/Stats/js_stats.csv")
colnames(js) <- c("Lines","FunctionDefs")
js$Language= "JavaScript"

data = rbind(py,js)
# some programs must not have been finished yet from the repository of programs?
#data = data[data$Lines>0,]
#data = data[data$FunctionDefs>0,]

# we have a few outliers in our data that skew things
out_js.lines <- boxplot(data$Lines[data$Language=="JavaScript"], plot=FALSE)$out
out_js.funcs <- boxplot(data$FunctionDefs[data$Language=="JavaScript"], plot=FALSE)$out
out_py.lines <- boxplot(data$Lines[data$Language=="Python"], plot=FALSE)$out
out_py.funcs <- boxplot(data$FunctionDefs[data$Language=="Python"], plot=FALSE)$out


data <- data[-which(data$Lines[data$Language=="JavaScript"] %in% out_js.lines),]
data <- data[-which(data$FunctionDefs[data$Language=="JavaScript"] %in% out_js.funcs),]
data <- data[-which(data$Lines[data$Language=="Python"] %in% out_py.lines),]
data <- data[-which(data$FunctionDefs[data$Language=="Python"] %in% out_py.funcs),]

Don’t Compare Apples to Oranges

You may notice when we try to compare the languages, that it’s very hard to see what’s going on with the Python programs as opposed to the JavaScript programs. That’s because we have entirely different scales; The JavaScript AST traces all of the callback functions, resulting in a mean of 55 functions per program, while Python has 3 functions per program. It could also be that our sample is simply not as comparable as we thought. The JavaScript files are significantly longer than the Python programs (JS mean lines = 2646, Python mean lines = 69). Not only are they difficult to visualize side-by-side, it is an actual statistical issue if we were to compare them! Our samples are just too different. But we are not entirely out of luck. We get to use normalization to make our apples (JS) and oranges (Python) more comparable.

plt = ggplot(data,aes(Lines,FunctionDefs))+
  facet_wrap(~Language)+
  geom_point(aes(colour=Language),size=1, show.legend=FALSE)+
  #geom_vline(xintercept=mean(data$Lines),colour="red",linetype="dashed")+
  #geom_hline(yintercept=mean(data$FunctionDefs),colour="red",linetype="dashed")+
  ggtitle("Lines v. Function Definitions")+
  xlab("Number of Lines (raw)")+
  ylab("Number of Function Definitions (raw)")+
  
  theme_bw()
  
plt