Alternative story download format

Printer-friendly version

Forums: 

Taxonomy upgrade extras: 

Hello everyone,

I love reading the stories posted on this site, but unfortunately my E-reader does not really get along well with plain HTML format documents, and copying them into a RTF/plaintext manually would be a big pain in the butt and probably butcher the formatting. I was thinking of creating something for my personal use to convert stories (given their URL) to an epub document to make it easier to view on portable devices.

I could create a script for myself, but if there is wider interest I might be able to make something that others can easily use. If there is interest, I've got some ideas in mind - but before starting I guess it would be important to establish if I could go about doing thing in a way that would avoid stepping on any toes/breaking any policy/upsetting any authors. My biggest concern is respecting the authors and site so, doing things link including links back to the original content and full author credits would be a no brainer, but for now I'm just looking for any feedback, comments, or ideas as to whether this would be useful and/or possible with regards to policies.

Cheers,
J

Possibilities

Your best bet would be to work on the printer-friendly versions of the pages, as they strip out the header, sidebars and comments. Some devices are capable of handling PDFs, and there are numerous free-to-use PDF printer drivers available on the 'net (unlike Acrobat Distiller!)

I think Piper may also have been working on such a script - it might be worth liaising with her to see what her ideas are on the subject.

There's also a chance such a script could also worth with our sister sites Stardust and Fictioneer, since they're also Drupal based.

 

Find me on Google+ | Examine EAFOAB Resources

There are 10 kinds of people in the world - those who understand binary and those who don't...

As the right side of the brain controls the left side of the body, then only left-handers are in their right mind!

Thanks for the advice

I had kind of written PDFs off due to their fixed page size and general difficulties I've had using them on different devices. EPUB looks like it would just be a matter of taking the "content" div and dumping it into an XHTML document, add the proper manifest files, and zip it up, which isnt really all too difficult. The only issue is that im not 100% certain the stories are all XHTML or just HTML, though they look like XHTML at a glance (tags are lowercase and closed properly).

My current idea would end up being a HTML page that allowed you to enter one or more URLs. When the user clicks 'convert' it would POST the URLs to a PHP page that would go through the URLs in order, scrape the story pages, and compile all the given stories into a single EPUB file (with a table of contents and an about page containing all the original URLs, etc). It would also be cool if you could give it an organizer page and it could just grab links from that and add them to a single EPUB, too.

If I'm just throwing it together for myself it would probably be easier to just toss some python together but I'd really like to give back to the community (if there is interest) after having the privilege of reading all these great stories :D

I'll be quiet now so the non-computery people who attempted to read this can have a break and try to keep their brain from exiting via their ears.

File sizes

Go through a story and roll all the chapters up into one convenient EPUB file?

I assume you never want to read 'Bike', then.

What you suggest will probably work for most of the stories posted here, but there are a number where merging into a single file really might not be a good idea.

Consider allowing several files, 'books' if you will, which will permit the handling of much more manageable sized files.

Not everyone has 100Mb Broadband connection, or wants to scroll through a 200MB file to find their current bookmark.

Penny

Good idea

Not only that, if there was a script that was doing thousands of page requests in a short period of time, it might look like a DoS attack and getting blocked would be a bad idea. I guess being able to force epub files into volumes of limited size would make a lot of sense. Thanks :D

This has been done

erin's picture

Piper has created a script to do this and we will be making that available, soon. :)

Hugs,
Erin

= Give everyone the benefit of the doubt because certainty is a fragile thing that can be shattered by one overlooked fact.

= Give everyone the benefit of the doubt because certainty is a fragile thing that can be shattered by one overlooked fact.

Piper has been working on

Piper has been working on one, I've seen it work. It seems to still have some issues however....

Samantha

Is piper's solution exactly identical?

In theory to what Joe-Q has proposed? Because I've been thinking about writing a localized program that runs directly on your own computer and uses the libcurl library to scrape the pages, grab all the latest stuff since last scrape, and export them to whatever you like.

I was thinking about it for those of us who have limited time at our computers, can leave this program scrape for us like twice a day and then review it when we get a chance. This since the site has a distinct lack of history beyond 100 of anything.

Could probably even export to text-to-speech, though I'd be leery of what sort of quality any free libraries I might be able to get to do so would output.

The only problem is time... I really don't have time to write it.

Abigail Drew.

Abigail Drew.

Is Calibre a solution?

Visit http://calibre-ebook.com/

This is free software that will convert assorted source files into e-reader compatible files. I've used it to covert both html and rich text format into mobi (I use a Kindle). Calibre will also convert into epub

Alternative story download format

To create your own epub, the Sigil Editor (http://code.google.com/p/sigil/) is probably the way to go.
It will import html and split chapters into separate files.

Sigil is a full WYSIWYG editor. very useful.

I, personally, use XML with an XSL transform. Used with XML2LIT back in 2001. I just ran with it. Still tweaking the output html for the epub to kindle conversion. I can now generate Lit, Html, and Txt (these convert from the xml2lit). Prc with a third party program (MobiPocketPublisher_Personal_US.exe) from mobipocket, an old version. An ePub version with the use of Sigil. And finally a PDF output with the help from Word.

My script, will run locally

Piper's picture

My script, will run locally on the server, so won't show as a DoS/DDoS or anything like that. It currently makes "chapterized" epub books of multi part stories and individual chapters.

I've also been working on extending the script to make MobiPocket/Kindle compatible files but currently the kindle support is VERY beta code and will put all the chapters in a single file with no table of contents and no images (the epub books all have a TOC that allows you to skip through as well as proper images where inserted by the author).

I make heavy use of curl/regexp and also lib-tidy and HTMLPurifier to make the HTML more presentable and compatible with ePub's XHTML restrictions and plan to have the code fully integrated inside BigCloset as a "download ebook" link when Erin and I finish the BigCloset upgrade. The goal is to allow authors to turn off this feature as they so wish, in case they choose to sell their own ebook copies via Amazon, Barnes and Noble, LuLu, DopplerPress, etc and to also eventually develop it into a proper drupal module where it draws everything from the database instead of via the printer friendly pages.

There is a Beta version of my script usable via a Chrome/IE/FireFox/Safari "Bookmarklet" that several users will be allowed access to shortly for extensive testing.

Just as a note, BigCloset is unable to deliver the printer friendly version of Bike in a single document. We have limits on both memory resources and processing time in place that will cause the script to error out, and so please don't even try.

Whilst we can understand everyone's want always have the latest stories in their favorite format, we ask that users don't usurp the authors ability to deny various output formats and also that you don't run any scheduled cron scrapes of the site. It may not seem like much to you, and your script may only run every 12 hours, but some people get over zealous and with an active community where there can easily be over 600 users online at any given time, robots and scripted services are the first things we will deny access to at heavy load.

-Piper


"Science is just magic with an explanation, and bumblebees are just tiny little fairies in disguise. :)" Submitted by Erin on Sun, 2010/04/04 - 6:37pm.



"She was like a butterfly, full of color and vibrancy when she chose to open her wings, yet hardly visible when she closed them."
— Geraldine Brooks


I've split Bike into several

I've split Bike into several eBooks. Each of 400 Chapters.
It is just too big to do much else.

Try to get to chapter 734 with just one book.

Still waiting for the Archive to update to get to the chapters 1451 and onwards.

If anyone wants to host the books, let me know.

As to the Kindle format. Seriously cut down version of html.
No bottom and right Margins. Can only use fonts installed on the Kindle. No in-line styles, and a few other Niggly formatting issues. A real pain to transform. Thank the Deities that Sigil has built in htmlTidy that can be turned on and off.

don't run scheduled scrapes...

I wouldn't do a full scrape, that's unnecessary and insane. I'd have my program keep track of the last time it checked, and then start with downloading only the pages listing the "latest" of each, comments, blogroll, and stories. It'd then only grab the pages since the -last- time it checked, and merely compile them as a list of links. When the user got home, they could then choose what to do with those links.

In effect, it'd be robotically doing what I do as often as I can anyway. Some "real" users probably do it even more often than 12 hours.

It probably wouldn't be too far a stretch to check against the list and prevent grabbing the same page, but a different comment for the comment roll.

I just hate missing anything, provide me with a much longer history, and I wouldn't even want to do something like this.

Abigail Drew.

Abigail Drew.