Posterior distribution of logistic regression coefficient

I have a binary logistic regression with the following properties

Consider the logistic regressionfor binarydataYi ∈{0,1} and the covariate vector xi =(xi,1,xi,2,…,xi,p) . Under the logistic regression assumption, the sampling distribution of Yi is (1). We assume a normal prior for βj for j = 1,…,n as in (2), and βj’s are independent of each other. (μj,σj2) are prior parameters and need to be specified.

Prob(Yi = 1|β) = exp(x⊤i β)/ 1+exp(x⊤i β) βj ∼N(μj,σj2) (2)

I have to find the posterior distribution for β, i.e., p(β|y, μ1, . . . , μp, σ12, . . . , σp2),.

How do you calculate the training error and validation error of a linear regression model?

I have a linear regression model that I’ve implemented using Gradient Descent and my cost function is a Mean Squared Error function. I’ve split my full dataset into three datasets, a training set, a validation set, and a testing set. I am not sure how to calculate the training error and validation error (and the difference between the two).

Is the training error the Residual Sum of Squares error calculated using the training dataset? Is the validation error the Residual Sum of Squares error calculated using the validation dataset? What is the test set for exactly (I’ve learned the model using the training set, from the textbooks I’ve read I think this is the set to use to learn the model)?

Any help in clearing up these points is much appreciated.

Comparison of feature importance values in logistic regression and random forest in scikitlearn [closed]

I am trying to rank the features for binary classification, based on their importance using an ensemble method by combining the feature importances estimated by random forest and logistic regression. I know that logisticregression coefficients and random forest feture_importances are different values and Im looking for a method to make them comparable. Here is what I have in mind:

X=features y=lablels rf=RandomForestClassifier() rf.fit(X,y) RFfitIMP=rf.feature_importances_/rf.feature_importances_.sum() #normalizing feature importances to sum up to 1 lr=LogisticRegression() lr.fit(X,y) lrfitIMP=np.absolute(lr.coef_)/np.absolute(lr.coef_).sum() #Taking absolute values and normalizing coefficient values to sum up to 1 ensembleFitIMP = np.mean([featIMPs for featIMPs in [RFfitIMP,lrfitIMP]], axis=0) 

What I think the code does is to take the relative importance from both models, normalize them and returns the importance of features averaged over two models. I was wondering whether it is a correct approach for this purpose or not?

Sckit regression on power set of data

How do I run linear regression on every subset of dataframe in a loop with Linear Regression of scikit-learn?

    def sub_lists(list1):   sublist = [[]]    for i in range(len(list1) + 1):          for j in range(i + 1, len(list1) + 1):              sub = list1[i:j]              sublist.append(sub)              return sublist  X = sub_lists(df5);y = df4;  

I ran regression on this however it keeps on throwing error, its a .dta(stata) file.

Stepwise regression

I think that both forward selection and backward selection should give the same results if the evaluation model is deterministic and using the same variables gives the same results. Is this true? If so, what are the reasons for chosing one method over the other?

What is a good algorithm for generating a linear regression of positions in 3D space? (for getting the direction of a thrown object in VR)

I’m trying to get throwing to feel right in my VR game. I don’t plan on actually using physics to do this; my idea is to accurately determine the lateral direction of the throw, then move the object in the intended direction (with a doctored Y value) at a speed dependent on other factors.

The language of the code doesn’t matter too much, but I’m using Godot, so something that wouldn’t require me to import a bunch of different math libraries (by converting them to Godot’s Python-style GDScript) would be ideal. It doesn’t need to be robust.

To give you an idea of what I was thinking, my original plan was to save the position of the grabbing controller every frame while an object is being held, removing old positions if the controller is moving backwards, then averaging out the movement deltas over the last 10 or 15 frames or so before letting go (it should be consistent since Godot has a fixed-timestep “update” function available) and normalizing the vector to get the direction the object should travel in.

However, this blog post got me thinking that spending the time learning how to do the linear regression in a more robust way, then applying that to 3D/VR, might be worthwhile. I just wonder if, since I don’t need the actual linear or angular velocity, it might be overkill in terms of time spent on the feature.

regression DNN: gradient checking doesn’t match with backpropagation derivatives

I think the problem lies within a bad implementation of back propagation, it’s the only way i could explain this (gradient checking that doesn’t match backprop derivatives), but i’m not able to find any error, and derivatives are ok to me, so i ask you if the derivatives i compute make sense, or there is some error.

About DNN:

I built a 3 layer (1 input, 1 hidden, 1 output) neural network. My goal was regression, so the last layer has 1 neuron, and I used leakyRelu as activation function on the hidden layer, and no activation function on the output layer. I used Mean Squared Error (MSE) as cost function. I also used normalization, and i didn’t use regularization yet.

About derivatives:

dCost/dALayerOutput = d( 1/m * sum( (ALayerOutput – y)^2 ) )/dALayerOutput = 2/m * (ALayerOutput – y) = dA

dCost/DZLayerOutput = dA * dALayerOutput/dZLayerOutput = dA * d(ZLayerOutput) = dA * 1 = dZ (because i don’t apply any activation function to the last layer).

dCost/DWeightOutput = dZ * dZLayerOutput/dWeightOutput = dZ * d(WeightOutput * ALayerHidden + BiasOutput)/dWeightOutput = dZ * ALayerHidden = dW

dCost/DBiasOutput = dZ * dZLayerOutput/dBiasOutput = dZ * d(WeightOutput * ALayerHidden + BiasOutput)/dBiasOutput = dZ * 1 = dZ = dB

dCost/DALayerHidden = dZ * dZLayerOutput/dALayerHidden = dZ * d(WeightOutput * ALayerHidden + BiasOutput)/dALayerHidden = dZ * WeightOutput = dA-1

dCost/dZLayerHidden = dA-1 * dALayerHidden/dZLayerHidden = d( leakyRelu(ZLayerHidden) )/dZLayerHidden = dA-1 * dLeakyRelu(dZLayerHidden) = dZ-1

dCost/dWeightHidden = dZ-1 * dZLayerHidden/dWeightHidden = dZ-1 * d(WeightHidden * LayerInput + BiasHidden)/dWeightHidden = dZ-1 * LayerInput = dW-1

dCost/dBiasHidden = dZ-1 * d(WeightHidden * LayerInput + BiasHidden)/dBiasHidden = dZ-1 * 1 = dB-1

About Gradient Descent:

netWeightsLayerOutput = netWeightsLayerOutput – (learningRate * dW)

netWeightsLayerHidden = netWeightsLayerHidden – (learningRate * dW-1)

netBiasesLayerOutput = netBiasesLayerOutput – (learningRate * dB)

netBiasesLayerHidden = netBiasesLayerHidden – (learningRate * dB-1)


Did you find any error?

Bayesian Beta regression Model — Error in jags: Invalid parent value

I’m trying to run a Bayesian pooled model in jags through R and getting an error message

I found from people who have encountered similar problems that it could be triggered by values of the priors, negative value, log of negative, syntax errors etc. I have eliminated all of these but the error persists.

## just for the prediction pred.jac <- seq(min(test.bayes$  Latitude), max(test.bayes$  Latitude), 10)  data = list(   jac = test.bayes$  Jaccard,   lat = test.bayes$  Latitude,   pred.jac = pred.jac)   inits = list(   list(alpha = 1, beta = 2.5, sigma = 50),   list(alpha = 2, beta = 1.5, sigma = 20),   list(alpha = 3, beta = 0.75, sigma = 10))    {   sink("BetaPooledJAGS.R")   cat("       model{        # priors       alpha ~ dnorm(0, 0.0001)       beta ~ dnorm(0, 0.0001)       sigma ~ dunif(0, 10)         # likelihood         for (i in 1:length(jac)) {       mu[i] <- alpha + beta * lat[i]       a[i] <- ((1 - mu[i]) / (sigma^2) - 1 / mu[i]) * mu[i]^2       b[i] <- alpha * (1 / mu[i] - 1)       jac[i] ~ dbeta(a[i], b[i])         }        # predicted jaccard as derived quantities       for (i in 1:length(pred.jac)) {       mu_pred[i] <- alpha + beta * lat[i]       mu_pred1[i] <- exp(mu_pred[i])       }        }        ",fill = TRUE)   sink() }   n.adapt = 3000 n.update = 5000 n.iter = 5000 jm.pooled = jags.model(file="BetaPooledJAGS.R", data = data, n.adapt = n.adapt, inits = inits, n.chains = length(inits))  

When I run the code, I get the error below:

Error in jags.model(file = “BetaPooledJAGS.R”, data = data, n.adapt = n.adapt, : Error in node jac[1] Invalid parent values

Here’s the link to a subset of my data.

https://fil.email/IuwgYhKs

Correlation not found ( linear regression) problem

i am new to machine learning i am trying de develop my knowledge and skills with projects, while doing so i encountred a probleme where i didnt find a “well coreleated” varible with the target ,the highest corelation coeffcient i found was 0.44, so i did a scatter plot to determine how the 2 variables are going to behave in order to choose between a polynomial regression model or a linear regression ,so it turned out like this , i am clueless about what to do scatter plot

Manual Regression Tree using Python

I wrote a code to create a regression tree for a synthetic train data of size Np. The idea is, first I have the source node (which consists of all set of points) represented as a dictionary {'points':..., 'avg':..., 'left_node':..., 'right_node', 'split_point': }. The left and right nodes are the leafs after the splitting process of the whole data (source). split_point is for information about the best split. Then I loop to get deeper tree with maximum number of nodes specified before, also I set that a node must have more than 5 points in order it can be split.

This way, If I want to predict a point (x',y'), I can just start from source node source and check which region the point lies (left_node or right_node), ..and then continuing down the tree. Because all left_nodes and right_nodes values have the same structure as source….

Also, the form function is used to find the best split, the best split is the one with the smallest form(reg_1, avg1, reg_2, avg2). This is a greedy algorithm to find the best split.


I would like to know better ways to perform it..without external modules. But this is intended to be taught to high school students.


Full code:

import math import random import matplotlib.pyplot as plt   def form(region_1, av1, region_2, av2):     return sum([(i[1]-av1)**2 for i in region_1]) \            + sum([(i[1]-av2)**2 for i in region_2])  Np = 400 x_data = [abs(random.gauss(5, 0.2) + random.gauss(8, 0.5)) for i in range(Np)] y_data = [abs(random.gauss(10, 0.2) + random.uniform(0, 10)) for i in range(Np)] value = [abs(random.gauss(4, 0.5)) for i in range(Np)]  data = [((i,j), k) for i,j,k in zip(x_data, y_data, value)]  fig, ax = plt.subplots()  ax.plot(x_data, y_data, 'o')  fig.show()   ###### Splitting from the source node (all data)  source = {'points': data, 'avg': sum([i[1] for i in data])/Np, \           'split_point': None, 'left_node': None, 'right_node': None} forms = []  for x in x_data:     var = x     region_1 = [j for j in data if j[0][0] <= var]     region_2 = [j for j in data if j not in region_1]      if len(region_1) > 0 and len(region_2) > 0:          av1 = sum([i[1] for i in region_1])/len(region_1)         av2 = sum([i[1] for i in region_2])/len(region_2)          f = form(region_1, av1, region_2, av2)         leaf_1 = {'points': region_1, 'avg': av1}         leaf_2 = {'points': region_2, 'avg': av2}         forms.append( (leaf_1, leaf_2, ('x', var), f) )  for y in y_data:     var = y     region_1 = [j for j in data if j[0][1] <= var]     region_2 = [j for j in data if j not in region_1]      if len(region_1) > 0 and len(region_2) > 0:          av1 = sum([i[1] for i in region_1])/len(region_1)         av2 = sum([i[1] for i in region_2])/len(region_2)          f = form(region_1, av1, region_2, av2)         leaf_1 = {'points': region_1, 'avg': av1}         leaf_2 = {'points': region_2, 'avg': av2}         forms.append( (leaf_1, leaf_2, ('y', var), f) )   sorted_f = sorted(forms, key = lambda x: x[3]) best_split = sorted_f[0] source['split_point'] = best_split[2] source['left_node'] = best_split[0] source['right_node'] = best_split[1]   ##### Splitting from the 2 leafs and so on..  leafs = [source['left_node'], source['right_node']]  all_nodes = [leafs[0], leafs[1]]  max_nodes = 1000  while len(all_nodes) <= max_nodes:     next_leafs = []     for leaf in leafs:         if (len(leaf['points']) > 5):              xx = [i[0][0] for i in leaf['points']]             yy = [i[0][1] for i in leaf['points']]             rr = [i[1] for i in leaf['points']]             vv = [((i,j), k) for i,j,k in zip(xx, yy, rr)]             forms = []              for x in xx:                 var = x                 region_1 = [j for j in vv if j[0][0] <= var]                 region_2 = [j for j in vv if j not in region_1]                  if len(region_1) > 0 and len(region_2) > 0:                      av1 = sum([i[1] for i in region_1])/len(region_1)                     av2 = sum([i[1] for i in region_2])/len(region_2)                      f = form(region_1, av1, region_2, av2)                     leaf_1 = {'points': region_1, 'avg': av1}                     leaf_2 = {'points': region_2, 'avg': av2}                     forms.append( (leaf_1, leaf_2, ('x', var), f) )              for y in yy:                 var = y                 region_1 = [j for j in vv if j[0][1] <= var]                 region_2 = [j for j in vv if j not in region_1]                  if len(region_1) > 0 and len(region_2) > 0:                      av1 = sum([i[1] for i in region_1])/len(region_1)                     av2 = sum([i[1] for i in region_2])/len(region_2)                      f = form(region_1, av1, region_2, av2)                     leaf_1 = {'points': region_1, 'avg': av1}                     leaf_2 = {'points': region_2, 'avg': av2}                     forms.append( (leaf_1, leaf_2, ('y', var), f) )              sorted_f = sorted(forms, key = lambda x: x[3])             best_split = sorted_f[0]             leaf['split_point'] = best_split[2]             leaf['left_node'] = best_split[0]             leaf['right_node'] = best_split[1]              print(leaf['split_point'])              next_leafs.append(leaf['left_node'])             next_leafs.append(leaf['right_node'])              print("\n")      leafs = next_leafs     all_nodes.extend(leafs)     if len(leafs) == 0:         break