Caught web content to generate a Kindle e-book

Since I bought a Kindle, I always think about how to maximize utility. Although there are many books that have many books can be purchased, there are also many free e-books, but there are still many interests of interest to exist in the form of a web page. For example, O’Reilly Atlas provides many e-books, but only provides free online reading; there are still many information or documents with only webpage. So I hope that these online data will be told to EPUB or MOBI format in a Kindle to read. This article describes how to achieve this with Calibre and write a small number of code.

Calibre

Introduction to CALIBRE

Calibre is a free e-book management tool that can be compatible with Windows, OS X and Linux. The file (actually python code) scratch the specified page content and generates e-books in formats such as MOBI. Customize the capture behavior by writing Recipes to accommodate different web structure.

Install Calibre

The download address of the Calibre is http://calibre-ebook.com/download, and the corresponding installer can be downloaded according to your operating system.

If it is a Linux operating system, you can also install it through a software warehouse:

Archlinux:

Pacman -s Calibre

Debian / ubuntu:

Apt-Get Install Calibre

Redhat / Fedora / Centos:

Yum -y Install Calibre

Note that if you use OSX, you need to install the Command Line Tool separately.

Crawl the web generating e-book

The following is taken as an example, how to generate e-books from the web page via Calibre.

Find index page

To take a book, the first thing is to find Index pages, this page is typically Table of Contents, which is a directory page, where each directory neckline is connected to the corresponding content page. The Index page will guide which pages are grabbed and the contents of the content organizational order. In this example, the Index page is http://chimera.labs.oreilly.com/books/1230000000561/index.html.

Write Recipes

Recipes is a script with Recipe as an extension. The content is actually a Python code to define the scope and behavior of the Calibre scratch page. The following is the Recipes used to capture the Git Pocket Guide:

from calibre.web.feeds.recipes import BasicNewsRecipeclass Git_Pocket_Guide (BasicNewsRecipe): title ‘Git Pocket Guide’description’ ‘cover_url’ http://akamaicovers.oreilly.com/images/0636920024972/lrg.jpg’url_prefix ‘http: // chimera.labs.oreilly.com/books/1230000000561/’no_stylesheets Truekeep_only_tags [{ ‘class’: ‘chapter’}] def get_title (self, link): return link.contents [0] .strip () def parse_index (self) : SOUP SELF.INDEX_TO_SOUP (Self.URL_PREFIX + ‘Index.html’) Div Soup.Find (‘Div’, {‘Class’: ‘Toc’}) Articles [] for link in div.findall (‘a’): IF ‘#’ IN LINK [‘HREF’]: Continueif Not ‘Ch’ in Link [‘Href’]: Continuetil Self.get_Title (Link) URL Self.URL_PREFIX + LINK [‘HREF’] a {‘title’: til , ‘URL’: URL} Articles.Append (a) ANS [(‘git_pocket_guide’, articles)] Return Ans The following interprets different parts of the code.

The overall structure

Overall, a Recipe is a python class, but this class must inherit calibre.web.feeds.recipes.basicnewsRecipe.

PARSE_INDEX

The core method of the entire Recipes is PARSE_INDEX, but also the only way that Recipes must implement. The goal of this method is to return a slightly complex data structure (later introduction) by analyzing the contents of the Index page (later introduction), which defines the content and content tissue sequence of the entire e-book.

Overall property setting

In the beginning of the Class, some global properties are defined:

title ‘Git Pocket Guide’description’ ‘cover_url’ http://akamaicovers.oreilly.com/images/0636920024972/lrg.jpg’url_prefix ‘http://chimera.labs.oreilly.com/books/1230000000561/’no_stylesheets Truekeep_only_tags [{‘Class’: ‘Chapter’}]

Title: E-book title

Description: E-book description

Cover_URL: Cover Picture of E-Book

URL_PREFIX: This is my own properties, the prefix of the content page, the full URL of the back assembly page

NO_STYLESHEETS: Do not use page CSS style

Keep_only_tags: This line tells the Calibre to analyze the DOM element of “Chapter” when you look at the Index page. If you see the source code of the index page, you will find this corresponding first level title. The reason is because in this example, the INDEX page corresponds to a separate content page each title, and the secondary header is only linked to an anchor in the page, so only the first level title is required.

PARSE_INDEX Return Value

The following describes that Parse_index needs to be returned by the data structure returned by the INDEX page.

The overall return data structure is a list, where each element is a tuple, a tuPle represents a volume. There is only one volume in this example, so there is only one tuple in the List.

Each tuple has two elements, the first element is a volume name, the second element is a list, each element is a map, indicating a chapter, two elements in Map: Title and URL Title is the chapter title, the URL is the URL of the chapter located. Calibre will grab and organize the entire book according to the return of PARSE_INDEX, and will grab and process pictures in the outside of the chain.

The entire PARSE_INDEX uses the SOUP to resolve the index page and generate the above data structure.

More

Above is the most basic Recipes, you want to know more to use, you can refer to the API documentation.

Mobi

After writing Recipes, you can generate e-books by the following command in the command line:

eBook-convert git_pocket_guide.recipe git_pocket_guide.mobi

E-books in the MOBI format can be generated. EBOOK-CONVERT will grab the relevant content and organize the structure according to Recipes code.

final effect

Here is the effects seen on Kindle.

content

Content

Two content

Page containing pictures

actual effect

My Recipes warehouse

I built a Kindle-Open-Books on GitHub, put some Recipes, there is some of my writings, and there are other students contributed. Welcome Recipes that people contribute.