SmartReader
A .NET Standard library to extract the main content of a web page
SmartReader is designed to remove the clutter from a web page: ads, sidebars, etc. and get you just the content. The core algorithm is a port of the Mozilla Readability library. The original library is stable and used in production inside Firefox. By relying on a library maintained by a competent organization like Mozilla we can piggyback on their hard and well-tested work.
SmartReader also adds some improvements on the original library, getting get more and better metadata:
- site name
- an author and publication date
- the language
- the excerpt of the article
- the featured image
- a list of images found (it can optionally also download them and store as data URI)
- an estimate of the time needed to read the article