likelihood and log likelihood

最编程 2024-07-25 15:38:46

...

References

https://stats.stackexchange.com/questions/289190/theoretical-motivation-for-using-log-likelihood-vs-likelihood
https://www.mathsisfun.com/algebra/logarithms.html
https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
https://www.quora.com/What-is-the-advantage-of-using-the-log-likelihood-function-versus-the-likelihood-function-for-maximum-likelihood-estimation
https://machinelearningmastery.com/what-is-maximum-likelihood-estimation-in-machine-learning/
https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability
https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
https://towardsdatascience.com/whats-a-logarithm-cca50d031241
https://mathbitsnotebook.com/Algebra2/Exponential/EXExpMoreFunctions.html
https://towardsdatascience.com/log-loss-function-math-explained-5b83cd8d9c83

Maximum likelihood estimation and log-likelihood

MLE(Maximum likelihood estimation)

MLE and pretty much every statistical approach assumes that observations are independent or at least conditionally independent. Thus, every likelihood can be written as

$\Pi^N_{i=1} f\big(y_i, x_i, \Theta\big)$
$f(y_i, x_i, \Theta)$ gives the probability of observing $y_i$ given $x_i$ conditional on some parameters(s) $\Theta$ , and we pick $\Theta$ to maximize this likelihood. The exact shape of this f may be pretty funky, depending on how complicated your model is.

Since the log is a monotonic transformation, the argument that maximizes the log of a function is the same as the one that maximizes the original function. Thus, using a basic property of logs, the log-likelihood becomes a sum

$\Pi^N_{i=1} log\Bigg(f\big(y_i, x_i, \Theta\big)\Bigg)$

Since each term is separate, this is a lot easier to maximize. It is also more concave since the logarithm is a concave function, which makes newtonian methods of optimization work better. Numerical precision errors may be reduced. If you’re dealing with a simple model, it’s a lot easier to take an analytical derivative and find a closed-form solution. (Taking the derivative of sums is easy; taking the derivative of a lot of terms multiplied together gets messy!)

Natural log has nice properties when combined with probability models from the exponential family, but you would still want to use a log-likelihood even if your probability model is not in the exponential family of distributions. The fact that it can help cancel some exponential terms is just a bonus.

Quasi-Newton method

Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. The "full" Newton's method requires the Jacobian in order to search for zeros or the Hessian for finding extrema.

Difference between likelihood and probability

https://tinyheero.github.io/2016/03/17/prob-distr.html
"Probability mass functions (pmf) are used to describe discrete probability distributions.
Probability density functions (pdf) are used to describe continuous probability distributions."

Discrete Random Variables

The observed outcomes by $O$ and the set of parameters that describe the stochastic process as $\theta$ . Thus, when we speak of probability we want to calculate $P(O|\theta)$ . In other words, given specific values for $\theta, P(O|\theta)$ is the probability that we would observe the outcomes represented by $O$

However, when we model a real-life stochastic process, we often do not know $\theta$ , simply observe $\O$ and the goal then is to arrive at an estimate for $\theta$ , that would be a plausible choice given the observed outcomes $\O$ . We know that given a value of $\theta$ the probability of observing $O$ is $P(O|\theta)$ . Thus, a 'natural' estimation process is to choose that value of $\theta$ that would maximize the probability that we would actually observe $O$ .
In other words, we find the parameter values θ that maximize
$L(\theta|O) = P(O|\theta)$
$L(\theta|O)$ is called the likelihood function, it is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$

Continuous Random Variables

The answer depends on whether you are dealing with discrete or continuous random variables. So, I will split my answer accordingly. I will assume that you want some technical details and not necessarily an explanation in plain English.

Discrete Random Variables

Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes, etc.). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is p and that coin tosses are independent).

Denote the observed outcomes by O and the set of parameters that describe the stochastic process as θ. Thus, when we speak of probability we want to calculate P(O|θ). In other words, given specific values for θ, P(O|θ) is the probability that we would observe the outcomes represented by O.

However, when we model a real-life stochastic process, we often do not know θ. We simply observe O and the goal then is to arrive at an estimate for θ that would be a plausible choice given the observed outcomes O. We know that given a value of θ the probability of observing O is P(O|θ). Thus, a 'natural' estimation process is to choose that value of θ that would maximize the probability that we would actually observe O. In other words, we find the parameter values θ that maximize the following function:

$L(θ|O)=P(O|θ)$

L(θ|O) is called the likelihood function. Notice that by definition the likelihood function is conditioned on the observed O and that it is a function of the unknown parameters θ.

Continuous Random Variables

In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed $O$ given $θ$ because in the continuous case $P(O|\theta) = 0$

Denote the probability density function (pdf) associated with the outcomes $O$ as $f(O|\theta)$ , thus, in the continuous case we estimate $\theta$ given observed outcomes $O$ by maximizing the function
$L(θ|O)=f(O|θ)$
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe $O$ as we maximize the PDF associated with the observed outcomes $O$

上一篇： likelihood 函数详解：概念、最大化应用与最小二乘解的相关性

下一篇：找到最佳猜测：极大似然估计方法简明解读

likelihood and log likelihood

References

Maximum likelihood estimation and log-likelihood

MLE(Maximum likelihood estimation)

MLE and pretty much every statistical approach assumes that observations are independent or at least conditionally independent. Thus, every likelihood can be written as

Difference between likelihood and probability

使用OpenCV的cv::log函数计算自然对数

通过atrace工具获取systrace log

C# WinForm应用中如何利用log4net同时生成两个日志文件

优化 [.NET] Log4net FileAppender 和 MinimalLock 性能问题的解决方案（第二部分：使用 MSMQ）

以吃瓜视角全面解析Log4j事件始末

剖析hs_err_prd.log错误日志

Bin-Log Distributor 数据消费丢失问题的解决方案与修复历程

在MySQL中理解binlog参数: 记录行级查询事件的binlog_rows_query_log_events

生物信息学初探：网页版罗伯特塔折叠蛋白预测体验实录 - log10分享

likelihood and log likelihood

References

Maximum likelihood estimation and log-likelihood MLE(Maximum likelihood estimation) MLE and pretty much every statistical approach assumes that observations are independent or at least conditionally independent. Thus, every likelihood can be written as

Difference between likelihood and probability

使用OpenCV的cv::log函数计算自然对数

通过atrace工具获取systrace log

C# WinForm应用中如何利用log4net同时生成两个日志文件

优化 [.NET] Log4net FileAppender 和 MinimalLock 性能问题的解决方案（第二部分：使用 MSMQ）

以吃瓜视角全面解析Log4j事件始末

剖析hs_err_prd.log错误日志

Bin-Log Distributor 数据消费丢失问题的解决方案与修复历程

在MySQL中理解binlog参数: 记录行级查询事件的binlog_rows_query_log_events

生物信息学初探：网页版罗伯特塔折叠蛋白预测体验实录 - log10分享

Maximum likelihood estimation and log-likelihood

MLE(Maximum likelihood estimation)

MLE and pretty much every statistical approach assumes that observations are independent or at least conditionally independent. Thus, every likelihood can be written as