letian - 90+

赞同来自: DevHaufior amours jefflee yangyangyang

Willi Richert , Luis Pedro Coelho著的《机器学习系统设计》第4章专门讨论了LDA主题模型,这一章并没有给出LDA的具体原理,但专门用一节讲了“Choosing the number of topics”。摘录部分:

So far, we have used a fied number of topics, which is 100. This was purely an
arbitrary number; we could have just as well done 20 or 200 topics. Fortunately,
for many users, this number does not really matter. If you are going to only use the
topics as an intermediate step as we did previously, the fial behavior of the system
is rarely very sensitive to the exact number of topics. This means that as long as
you use enough topics, whether you use 100 topics or 200, the recommendations
that result from the process will not be very different. One hundred is often a good
number (while 20 is too few for a general collection of text documents). The same
is true of setting the alpha (α) value. While playing around with it can change the
topics, the fial results are again robust against this change.
If you are going to explore the topics yourself or build a visualization tool, you
should probably try a few values and see which gives you the most useful or most
appealing results.
However, there are a few methods that will automatically determine the number of
topics for you depending on the dataset. One popular model is called the hierarchical
Dirichlet process. Again, the full mathematical model behind it is complex and beyond
the scope of this book, but the fable we can tell is that instead of having the topics be
fied a priori and our task being to reverse engineer the data to get them back, the
topics themselves were generated along with the data. Whenever the writer was going
to start a new document, he had the option of using the topics that already existed or
creating a completely new one.
This means that the more documents we have, the more topics we will end up with.
This is one of those statements that is unintuitive at fist, but makes perfect sense upon
reflction. We are learning topics, and the more examples we have, the more we can
break them up. If we only have a few examples of news articles, then sports will be a
topic. However, as we have more, we start to break it up into the individual modalities
such as Hockey, Soccer, and so on. As we have even more data, we can start to tell
nuances apart articles about individual teams and even individual players. The same
is true for people. In a group of many different backgrounds, with a few "computer
people", you might put them together; in a slightly larger group, you would have
separate gatherings for programmers and systems managers. In the real world, we
even have different gatherings for Python and Ruby programmers.
One of the methods for automatically determining the number of topics is called
the ** hierarchical Dirichlet process (HDP)**