Get fitted coefficient of linear regression equation

Question

I have a dataset with predicted and observed data. The equation that predicts the data is given by: y = AfT $\sqrt{gh}$

With Af = constant (now at 1.35), T = wave period, g = gravitation 9.81, h = wave height.

Id like to use linear regression to find the best fitted coefficient (Af in the equation), so that the predicted value is closer to the observed data.

I now have Af = 1.35 (from suggestion in the literature) results in r^2 = 0.5676 Ideally, I`d use python to find the best fitted coefficient for my data.

import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X = np.array([11.52, 11.559, 12.31, 16.46, 11.84, 7.38, 9.99, 16.72, 11.617, 11.77, 6.48, 9.035, 12.87, 11.18, 6.75])
y = np.array([25.51658407, 24.61306145, 19.4007494, 24.85111923, 25.99397106, 14.30284824, 17.69451713, 27.37460301, 22.23326366, 18.44905152, 10.28001306, 10.68681843, 28.85399089, 14.02840557, 18.41941787]).reshape((-1, 1))

X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)

print(clf.coef_, clf.intercept_)

X = observed/measured values in the field, y = the predicted values of X using the equation

I have difficulties incorporating the actual equation and finding the best fit for Af.

score 0 · Answer 1 · Apr 14, 2022

You may fit an ordinary least squares model from scikit-learn to the data you gave by taking the log.

import numpy as np
import pymc3 as pm
import pandas as pd
import theano.tensor as tt
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf

train = np.log(np.array([11.52, 11.559, 12.31, 16.46, 11.84, 7.38, 9.99, 16.72, 11.617,
                     11.77, 6.48, 9.035, 12.87, 11.18, 6.75]))
test = np.log(np.array([25.51658407, 24.61306145, 19.4007494, 24.85111923, 
                     25.99397106, 14.30284824, 17.69451713, 27.37460301, 
                     22.23326366, 18.44905152, 10.28001306, 10.68681843, 
                     28.85399089, 14.02840557, 18.41941787]))

reg_model = LinearRegression().fit(train.reshape(-1, 1), test.reshape(-1, 1))
print('Af estimate: ', np.exp(reg_model.intercept_))

This yields the following Af estimate:

Af estimate:  [2.4844087]

You might use statsmodels to get the best estimate of the linear model parameters as you don't seem to be interested in predicting new data with a model.

result = smf.ols('test ~ train + 1', data=pd.DataFrame({'test':test,'train':train})).fit()
print('Statsmodels Af estimate: ', np.exp(result.params['X']))

It yields a result of 2.366, which is very similar to the prior figure. r2 is the same as the one you mentioned.
Finally, my recommendation is to utilize pymc3 to acquire a full bayesian fit, which will allow you to estimate the uncertainty of the number you wish to measure naturally. Although pymc3 has a steep learning curve, it is an excellent library for probabilistic programming. When fitting a model, it allows you to estimate the whole posterior of your parameter space, which is what most people are interested in. The following is an example of a solution to your problem:

with pm.Model() as model: 
    # Prior
    alpha = pm.Normal('alpha', mu=1.35, sd=5) # centered around the literature value
    beta = pm.HalfNormal('beta', sd=10) # only positive values as it goes into the sqrt. Also is height always positive here?
    sigma = pm.HalfNormal("sigma", sd=1) 
    beta2 = pm.Deterministic('beta2', tt.sqrt(beta*9.81)) # g is very well known
    alpha_f = pm.Deterministic('alpha_f', tt.exp(alpha)) # estimate directly the output value we want

    # Likelihood
    likelihood = pm.Normal('y', mu=alpha + beta2 * X,sigma=sigma,observed=y)

    # Samplingtemp
    trace = pm.sample(init='adapt_diag')

print(pm.summary(trace))

          mean     sd  hpd_3%  hpd_97%  ...  ess_sd  ess_bulk  ess_tail  r_hat
alpha    0.781  0.544  -0.232    1.864  ...   309.0     440.0     406.0   1.01
beta     0.091  0.044   0.013    0.167  ...   517.0     438.0     359.0   1.01
sigma    0.259  0.056   0.172    0.368  ...   530.0     479.0     147.0   1.00
beta2    0.917  0.229   0.439    1.316  ...   434.0     438.0     359.0   1.01
alpha_f  2.535  1.552   0.465    5.224  ...   317.0     440.0     406.0   1.01

As you can see, there is a great deal of ambiguity in Af.
However, it is critical to take into account the data that is input and not to overinterpret the outcomes. You don't supply any uncertainty in either y or X, or in the covariance matrix, at the present. However, it is quite rare that you have perfect knowledge of these numbers, thus it is prudent to factor these uncertainties into your calculations. pymc3 makes it possible to do so in a natural way. My implementation offers a data-based estimate of uncertainty, but you may have your own measurement-device-based uncertainty.