Brought to you by
 2019 ASM / R-0052 A practical step-by step guide for radiologists on machine learning.
 Congress: 2019 ASM Poster No.: R-0052 Type: Educational Exhibit Keywords: Education and training, Computer Applications-General, Neural networks, Artificial Intelligence Authors: D. T. Wang1, S. S. Wang2, L. L. Wang3; 1Ohio/US , 2Utah/US, 3Cincinnati/US DOI: 10.26044/ranzcr2019/R-0052 DOI-Link: http://dx.doi.org/10.26044/ranzcr2019/R-0052

# Background

Machine learning is not a new concept and has been present since the 1950s.  It has, however, seen an exponential increase in interest and application in the past 10 years partly due to the increased computer processing power available.  This increase in popularity is not just in the computer science field, but has spilled over into the radiology realm, with the recent creation of the Radiology: Artificial Intelligence journal. While there are many reference articles available, a step-by-step how-to guide is lacking.

The premise of machine learning uses the principle of regression. In the simplest of examples, imagine trying to predict the length of hospital stay for a patient from the number of their co-morbidities. The number of co-morbidities is your feature (x) and it is plotted against the length of hospital stay, output (y). In order to predict what a patient’s length of stay is, you decide to fit a line to the data ( Fig. 1 ) and call that your hypothesis function, h(x). Each data point is some distance from our theoretical line of best fit and the line is chosen to minimise the sum of this distance. This is linear regression and can also be represented as Fig. 2

The red line in Fig. 1 represents h(x), where

h(x) = w0 + w . x

Optimising w and w0 will give the line of best fit. w is also known as the weight. w0 determines the y-intercept of your line. The set containing your features, x, and output, y, is termed a training set.

Although linear regression may seem like a good fit, the data may be better predicted using a quadratic ( Fig. 3 ). Here you see that the distance from each data point to the line is less than that for the linear fit. It may fit the data better. The input values or features will be x and x2 and this can be represented in the form shown in Fig. 4 , where we denote x as x1 and x2 as x2.

Our hypothesis then becomes:

h(X) = w0 + w1 . x1 + w2 . x2

where X is the set of features [x1, x2].

Instead of just the number of co-morbidities, we may try and incorporate other clinical variables such as age and sex to increase our prediction ( Fig. 5 ). This means increasing our features x1, x2, x3, … xn. This is multivariate regression.

Instead of predicting a continuous variable, we can also predict categorical variables, with the output being the probability of that entity. Instead of a linear relationship between X and Y, we transform X using a sigmoid function, g(X). Where,

g(X) = 1 / (1 + e-X )

The function g(X) is shown in Fig. 6 . This is logistic regression. The hypothesis then becomes:

h(X) =  g(w0 + w1 . x1 + w2 . x2 + … + wn . xn

An application of machine learning in radiology, is classifying an image to obtain a diagnosis. For example, a chest x-ray may be interpreted as normal, pneumonia, pulmonary oedema, etc. The input (X) will be individual pixel values of the chest x-ray. Output (Y) for a classification problem will be the entities normal, pneumonia, pulmonary oedema, etc. This is represented in Fig. 7 .

Each of the circles represents a node. Each node in the subsequent layer is the sum of each node in the previous layer, multiplied by their weights (w), then passed through the activation function, g(x)

Unfortunately, using Fig. 7 to classify our images would not give very accurate predictions. We can include a layer between our input and output ( Fig. 8 ), called a hidden layer. The hidden layer can a different number of nodes compared to our input layer. There can also be more than one hidden layer ( Fig. 9 ). By optimising the weights between each of the nodes, we train our machine learning algorithm. The activation function, g(x) can be varied. Examples include the rectified linear unit (ReLU) or softmax function, which is beyond the scope of this exhibit. Nodes and activation functions form the basis for neural networks.