PDA

View Full Version : How does iSiloX build it's files?


AWS
01-05-2004, 07:09 AM
This is a general question concerning the creation of an iSilo document...

I know that specifying a link depth greater than 2 or 3 can create a large document but I have had some success in capturing complete web sites by specifying that offsite links not be followed. However, I have been working on one particular site with a link depth of 6, no off site links to be included. Now, after over 12 hours of "Retrieving Resources", I am wondering how much longer this is going to take, and, how large will the file become.

My general question is this:

When creating an iSilo document from a site with the following structure...

index.htm
|
-products.htm
|
-support.htm
|
-contact.htm
|
-images folder


If there is a common image (such as a logo) used by all three pages and the image file resides in the images folder, does iSilo import a copy of the image 4 times (once for each page captured) or is it intelligent enough to copy the image once and reference it in each of the 4 pages?


More importantly, if iSilo is set for a link depth of 2, and each page in the above structure has a sidebar menu that links to all pages in the site, does isilo import duplicate copies of each page for each link in all 2nd level sidebars?

Another twist on this scenario is how iSilo handles the parsing of pages with "bread crumbs". The line of links on the top of many site's pages which show the path taken to the current page using a list of links such as:

iSilo forum > Support > iSiloX and iSiloXC


In such a case where there are 10 pages under "iSiloX and iSiloXC" and each has a "Support" link on it, does iSiloX link to the "Support" page and import it 10 times or only once?

I ask these questions because in watching iSiloX take over 12 hours "Retrieving Resources" for my current project (which uses such a design) I am sure I have seen the same URL being retrieved many times.

Does iSiloX retrieve the multilple references and then sort and clean up the repetitive pages? Or will the resulting file grow exponentially because there are dozens, if not hundreds, of identical pages compressed into the final file?

AWS

iSilo
01-05-2004, 11:57 AM
The simple answer is that iSiloX/iSiloXC retrieves only one copy of the content of each unique URL.

To answer each of your examples directly:

If there is a common image (such as a logo) used by all four pages and the image file resides in the images folder, iSiloX imports a copy of the image once only and has each of the pages reference the same image.
If iSiloX is set for a link depth of two, and each page in the structure has a sidebar menu that links to all pages in the site, iSiloX retrieves each page only once even though every page's second level sidebar references all other pages.
In a case where there are 10 pages under "iSiloX and iSiloXC" and each has a "Support" link on it, iSiloX links to the "Support" page and imports it only once.
iSiloX/iSiloXC determine the uniqueness of a URL by performing a case-sensitive comparison of the URL with all previously retrieved URLs. If the content of a URL has been retrieved previously for the current document being converted, the content is not retrieved again. A reference is made to the already imported content, so only one copy of it will be present in the converted document.

Note that iSIloX/iSiloXC retrieve content sequentially and not in parallel, so that might be a possible reason for the apparent slowness.

A possible reason why you may be seeing what appear to be duplicate retrievals is that the site you are converting may use URLs that look the same but may in fact differ. Possible reasons might be the use of different case letters in URLs. For a Windows webserver, file names are not case-sensitive so if the URL of a link to a particular page from different locations differs in case, it will not matter to the webserver, but to iSiloX, the different references will each be unique even though they differ only in case. For a UNIX based webserver, file names are case-sensitive. Another reason for possibly duplicate retrievals is if the webserver appends a unique string to the URL so that each retrieval is unique.

If you really want to verify what iSiloX is retrieving, you can enable the connection log and see exactly what is being retrieved. To enable the connection log in iSiloX, do this:

In the Tools menu, click Options. This gives you the Options dialog.
In the Options dialog, click the Logging tab.
On the Logging tab, check Log connection messages to a file and enter the full path to the log file (i.e., c:\connect.log).
Click OK to accept the changes.
After enabling connection logging, the next time you perform a conversion, all web server connections will be logged to that connection log file.

AWS
01-05-2004, 12:38 PM
Thanks for the prompt reply.

Based on your explanation, I understand that if there are links in a site that refer to the same page but (because of URL parameters, named anchors, etc.) are unique in their form, then iSilo will keep a unique copy of that page.

In other words, if there is a sidebar menu on a page that has multiple links to another single page, but each link uses a named anchor, then iSilo will regard each of those links it is to follow as unique and include multiple copies of a single page in the final document.

example: if there is a main page that has a sidebar with five links to the same products page, but each link uses a named anchor to take the viewer to a particular place on the page, then iSilo would think the singe page was in fact 5 different pages and include 5 copies of the page in its final structure.

/products.htm#item1
/products.htm#item2
/products.htm#item3
/products.htm#item4
/products.htm#item5

Is this a correct understanding?

iSilo
01-05-2004, 02:11 PM
Sorry about not being clear about the anchor part. The anchor is not included in the URL comparison.

So taking your example of five links as follows:

/products.htm#item1
/products.htm#item2
/products.htm#item3
/products.htm#item4
/products.htm#item5

The URL of each of the above is /products.htm. The anchor parts (e.g., #item1, #item2, #item3, #item4, #item5) are not considered part of the URL for comparison purposes. So only one copy of /products.htm is retrieved and included.

AWS
01-05-2004, 02:36 PM
That was a pretty quick reply... Thanks!

So anchors are ignored in the URL comparison but any URLs with parameters would be filtered for uniqueness (as they should be since different parameters would produce a different page).

eg:
www.mysite.com/myapp.asp?Fname=Frodo?Lname=Baggins
and
www.mysite.com/myapp.asp?Fname=Samwise?Lname=Gamgee

Would be treated as unique pages.

Thanks,

AWS

iSilo
01-05-2004, 02:55 PM
Yes, that is correct. The query/parameter part is included in the URL comparison, but the anchor is not.

So yes, each of the following examples you give is unique:

http://www.mysite.com/myapp.asp?Fname=Frodo?Lname=Baggins
http://www.mysite.com/myapp.asp?Fname=Samwise?Lname=Gamgee