Rant IT: 2006

Thursday, March 30, 2006

IE and <pre> tags in the blog text

A couple of stories ago I mentioned that this blog looks broken when viewed in IE. I thought 'float' style, used in the blog template was a culprit. Turned out that it was due to <pre> tags I freely scattered around in text of some blog entries, featuring patches and code samples. Jist another thing to watch for when using blogger.

Links to the site and front page redirects

A quick observation, it seems that Google might have certain issues counting links to your site if it redirects from its main page "inland" of the site, i.e. for instance:


http://my.site.com |-(HTTP_MOVED_PERMANENTLY)-> http://my.site.com/news

Meanwhile, other sites, linking to yours are using base URL, i.e. http://my.site.com. Looks like a sure way to get zero page rank on the front page.

Monday, February 27, 2006

Can't click links in Google Adsense?

When it comes to earning a buck from online advertisement on your site, almost everyone turns to Google Adsense. It's nice, it's non intrusive (for the most part), and better yet - it's free. It's no sweat to put it on your site, it's customizable and blends in...

However, even if it looks good, it does not mean you'll start getting clicks right away, even if your visitors are willing. Why? There are certain peculiarities with links on a top of alpha transparent backgrounds in IE. Sometimes such links become non-clickable, the behavior described by Holger Rüprich and Justin Koivisto. There is a workaroud also suggested, but it works for links, which are already on the page. The way Adsense works is dynamic, links are produced by means of javascript. So the workaround does not work on them .

The solution would be to script a fix using DOM: pull elements of a particular type (A) and then apply a position style. I am working on this fix now. Once done, I'll publish it here. So stay tuned!

Saturday, February 25, 2006

Broken sidebar...

I noticed that this blog (at least in this particular theme) looks kind of wierd in some versions of IE. Side bar does not appear where it supposed to be, instead it shows up below the text area. This is, perhaps, due to some perks in IE regarding "float" style. Blogger could also be IE frendlier a bit... I'll fix this later.

Thursday, February 23, 2006

Search engines crawling agility: one site's ongoing experience

DISCLAMER: This report does not claim any particular accuracy. Your experience may vary or be altogether completely different.

Intro

My wife is an artist. It is a pretty tough place to be in if you do not know someone close to the high elves of art establishment. But the Internet is a perfect "equalizer" these days, almost like some crafts by Samuel Colt were in 19th century, right? Just get a website going and herds of eager visitors will eventually come pouring in. Before such notoriety comes, however, one should get listed with the search engines.

Our Experiences

My wife's been going through get-search-engines-to-know-your-website exercise since Feb 8th. At that pretty much the same time we submitted her site's URL to all engines we could find that provided such service: MSN, Yahoo, Gigablast, and Google. Next is a brief account of how fast and in which way search engines responded.

MSN

The first to come crawling were MSN and Gigablast - very same day. MSN quickly sniffed a few pages and departed. We were quite surprised to see that the site showed up in MSN search next day after MSN bot crawled the site. However, MSN postponed a more thorough site examination until later. It came a few days after initial crawl and scanned pages down to a third level (front page being 1st, pages referenced from front page 2nd, etc). Search results again, were updated next day following the crawl. Since then MSN bot has been quite regular visitor: every 3-4 days.

Gigablast

Gigablast's bot showed up the same day we submitted the site to it. Gigablast seems to have different agents to discover pages and to crawl them: there's a "bot" and there's a "spider" - which one does what - I don't really know. Gigablast have been very polite and would hit the site at the rate of one hit per about five seconds. That being also a factor in overall indexing speed. Being that polite takes a considerable time to crawl populated sites. It's still not quite out of my wife's website yet, and her's is not particularly populated. So site does not show up in Gigablast search just yet...

Yahoo

Yahoo was next with its slurp - a day after site submission. It glanced at a few pages briefly and left. However, site did not show up in search results until a few days later. Yahoo, however, has yet to dive beyond the depths of the first level. It shows up occasionally and scans pages it already visited, as if to make sure they are still there... It seems Yahoo applies a similar strategy to crawling as MSN - a brief, cursory scan, followed some time later by more inquisitive inspection of one or two levels deeper...

Google

The star of the Internet search came last - as frequently stars do - with a rigorous scan from the start, similar to that of Gigablast, but hitting more frequently. Around next day site showed in the Google search, however just the URL and nothing else. Even that would disappear every now and then. It took about 4-5 days al together to finish site crawl and about another five for more meaningful site links to appear in the search. It's kind of strange with Google, one day your site shows up in multitude of URLs, the other nothing comes up at all.

Conclusion

MSN showed so far the quickest crawl->update index turnaround. Yahoo is next. Then Google in a little bit strange way. Gigablast is not quite done yet.

To be continued

Friday, February 17, 2006

Fix: Time::Piece::strftime() causes coredump on OpenBSD 3.8-i368

I might have been able to find a fix. It is applicable to Time::Piece v1.09. I have already sent a patch to Time::Piece maintainer. For convenience, I'll provide it here too:

Matt,

I might have been able to identify the culprit. It looks like the following statement in Piece.xs(line 26):

#ifdef HAS_GNULIBC

meant to read:

#if !defined(HAS_GNULIBC)

as BSD systems are obviously not GNULIBC based. Once I applied this change, coredumps went away (on OpenBSD). Also, I added an zero-ization of 'mytm' at line 819:

memset(&mytm, 0, sizeof(mytm));

Now it seems working perfectly fine on OpenBSD and Linux (Debian Sarge stable). Please find a patch attached in case you'd like to include it in next release of the package. Thank you.

Here is the patch itself:


--- /home/olga/tmp/Time-Piece-1.09/Piece.xs     2005-11-15 17:46:23.000000000 -0500
+++ Piece.xs    2006-02-17 22:23:28.000000000 -0500
@@ -23,7 +23,7 @@
  * support is added and NETaa14816 is considered in full.
  * It does not address tzname aspects of NETaa14816.
  */
-#ifdef HAS_GNULIBC
+#if !defined(HAS_GNULIBC)
 # ifndef STRUCT_TM_HASZONE
 #    define STRUCT_TM_HASZONE
 # else
@@ -816,6 +816,7 @@
         char tmpbuf[128];
         struct tm mytm;
         int len;
+       memset(&mytm, 0, sizeof(mytm));
         my_init_tm(&mytm);    /* XXX workaround - see my_init_tm() above */
         mytm.tm_sec = sec;
         mytm.tm_min = min;

To apply obtain and extract the Time::Piece v1.09 (http://search.cpan.org/CPAN/authors/id/M/MS/MSERGEANT/Time-Piece-1.09.tar.gz). Then go to Time-Piece-1.09 directory, copy patch file there and apply it:


patch <patch

Continue on to build the package as usual. Enjoy!

Time::Piece::strftime() causes coredump on OpenBSD 3.8-i368

Just noticed, frequently (but not always) strftime method of Time::Piece package (v1.08) causes process that uses it to coredump. For instance as simple as


perl -MTime::Piece -e 'print localtime->strftime'

would produce a perl.core on my system.I'll use POSIX::strftime for now.I submitted a bug report, however in meanwhile I will look at fixing it myself. The timesone pointer in the below mytm structrure seems to be a culprit:


#3  0x06223323 in XS_Time__Piece__strftime (cv=0x8aab36fc) at Piece.xs:830
830             len = strftime(tmpbuf, sizeof tmpbuf, fmt, &mytm);
(gdb) p mytm
$1 = {tm_sec = 45, tm_min = 13, tm_hour = 17, tm_mday = 17, tm_mon = 1, tm_year = 106, 
  tm_wday = 5, tm_yday = 47, tm_isdst = 0, tm_gmtoff = -2089609684, 
  tm_zone = 0xde6cd6c <Address 0xde6cd6c out of bounds>}

Friday, February 10, 2006

Search Engines in Dynamic Web Era

Future web, dynamic or static?

It is perhaps hard to contest that modern web becomes increasingly dynamic in nature. There is barely a site out there, which does not have a part of its content if not the whole dynamically generated. All the mighty in the industry are throwing their weight behind their respective technologies. Herds of them exist today and counting. The (relatively) recent advancement of client-based dynamic content generation, Flash and AJAX in particular brings it all to a whole new level.

Points of entry. Many or one?

There are also so called (in a good sense) best practices, or patterns, etc of the good web design. For instance, nowadays predominant MVC (Model View Controller). Not that I am trying to say that you don't already know what it is, or unfamiliar with the term. More like in ever so populated terminology space there are duplicates and triplicates of the same abbreviations sometimes. Though I've never heard of MVC being overloaded before, but there's always someone in the crowd who definitely has. Many of the web oriented MVC design patterns pitch an idea of the 'front controller', a universal application dispatcher and coordinator, which orchestrates delivery of the incoming requests to its appropriate place in the handling chain. That is everything is siphoned through this component. In the URL terms, how may it look like? Perhaps somewhat like: http://your.site.com/FrontController?bla=ssdas&bla1=... In URL terms, requests mostly only differ in the query part relevant it might be in the context of search engines? We're about to find out.

Dynamic URLs. Bad?

Speaking of URLs inside our dynamic pages, perhaps they are unlikely static as well? Except for image URLs for the most part. URL rewrite technique is convenient as it gives the same power without involving forms. Yes, ASP.NET tribe, no forms and no __postback() javascripting. I'd call dynamic URLs 'inline forms'. Tracking of a current page and total number of pages, previous and next page numbers (if page long enough and paginated), and whatever else we may stuff these URLs with. Is it bad? Perhaps not. Or at least not always. Some webservers actually re-write URLs behind your back. For example take almost any servlet/jsp container. Disable cookies in your browser and give it a try on such a server with one of its demo apps. Occasionally look at pages source. Very probable you will see something looking like ';jsessionid' appended to the end of some URLs. Yea, to make things faster many if not all servlet/jsp containers pre-create user sessions for you in hope that you eventually will need them. Of course servlet container would attempt to use cookies first, but with many browsers now supporting cookie filters and users being ever security paranoid over plethora of viruses, spy and malware, there's a good chance that session cookies will be rejected every now and then. So what might it all mean for the search engine? Read on!

In-browser rendered dynamic content

Client side scripting, whether a form of JavaScript driven dynamic HTML or plugin based technology (like Flash), though very convenient for humans and pleasing in appearance, it is not so much perhaps altogether not at all convenient, neither it is pretty to the faceted eye of a search engine. How much not pretty? We'll see.

Ranks to the flanks

Everyone seem obsessed over their site, page, blog, etc search engines rankings. If you aren't listed, you do not exist. You have got to be in a top 10. Myth or fact: people (almost) never look past top 10 - a (first) page of search results? Not mentioning that some search engines result pages contain more that 10 entries on the first page ;). Search engine optimization services of all shapes, forms and colors flock the cyberspace. One can spit and likely it'll land on some SEO. They might be very much like hair loss remedies, sometimes.

If Search Engines were humans

So how are the search engines of a present day cope with all of that? Hopefully not in a way The Inquisition did. I claim no expertise in search engines technology, in fact I know perhaps little more or little less about it than your ordinary Joe the programmer. I'm only rumbling over what is published on the Net. Cursory examination of webmaster guidelines, published by some mainstream search engines brings a number of things an S.E. weary webmaster may want to watch for. Here's a short (perhaps incomplete - feel free to add) summary of them:

Every page's to be accessible through a static link

A some sort of word of a caution regarding dynamic URLs, i.e try to be able to render (navigate through) your pages without them

Content, produced by means of in-browser scripting, i.e. JavaScript-ed DHTML (there goes your AJAX), or plugin based (followed by Flash, cheers) is unlikely to be considered

Re-iteration of the word of caution regarding dynamic URLs, usually more specific this time. To a webmaster that may mean that some sorts of URL query parameters you might be willig to use, such as language code, page numbers, article numbers, view modes, etc may hinder crawler's crawling abilities

Multi-sourced pages. I.e. framesets are tough for crawlers. One's guess might be that IFRAMEs are perhaps just as tough.

Google goes further on a path of dynamic URLs - they seemingly ignore URLs, which contain '&id=':

...Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index.

Read more at: www.google.com/webmasters/guidelines.html

Let's give a closer look to what these guidelines may actually mean with respect to popular web development techniques.

Single application entry point (AKA Front Controller)

One of the ways in using front controller is to have query parameters to steer it. I.e. for example, in URL terms:

http://your.site.com/FrontController?page='MyPage'&lang=en

Renders 'MyPage' in English.

Sometimes, perhaps not without search engine factor involved, front controller may draw its input from remainig URL path, found beyond its context path, i.e. what follows it in the URL:

http://your.site.com/FrontController/page/MyPage?&lang=en

Still, even with this approach there might be some degree of query parameterizing desired, such as this. It will become more problematic to encode parameters as fragments of URL paths as number of parameters grow, contributing factor being that query parameters are in nature name/value pairs, something that URL path fragments are not. It may quickly grow out of hand and become plain ugly to say the least. However how much does this matter in connection to the webmaster guidelines listed above? Let's see:

.... Rrriight.Could be a bit tricky, but certainly doable. Nothing is not-doable. We can always go with second approach in our front controller, and default whatever additional parameters we might have to some meaningful values. Like setting page language to 'en' in above example. That will work. Unless we want search engine to index our French version of a page as well. And might be our Spanish one too... On to the next point!

.... This is pretty much along the same lines with the first point with regard to a front controller. Remaining points can be either seen through the same lens (4) or not applicable to front controller.

Dynamic URLs, URL re-write, form-based navigation(ASP.NET)

It might be a warm welcoming to the search engine hell for ASP.NET folks. Looks like they have to either avoid using some goodies of ASP.NET (which are, unfortunately based on form navigation) in favor of more traditional URL encoding, or probably forget about search engines all together. Why is that? Mainly because of:

I am not 100% sure on that but I have a feeling that crawlers won't submit your forms. Secondly...

Number 3 above.Yes, to differentiate between different ASP.NET events form must be submitted in different ways, which thereby done by javascripting form submission. Bummer.

It's not perhaps as gloomy for URL re-write and other dynamic form of URLs as it appears, because it seems that search engines would honor them for the most part, with noted exceptions.At least not perhaps as dark as to producers of in-browser generated content...

The Flash of AJAX

Perhaps AJAX and Flash content producing folk may seem to be in the deepest trouble of all regarding the guidelines. They are outright denied any indexing of their content at least until search engines begin executing javascript and Flash-ing as they crawl. By that time, perhaps, that ASP.NET hell will freeze over ;). However, AJAX people are probably more or less OK as long as there is at least some HTML left that crawlers can see without actually running any javascript.It is likely often the case as AJAX apps tend to be user session centric, like web based e-mail, all sorts of online planners,etc that require user personification prior to their usage. But if the application is entirely in Flash, and HTML is only used to bootstrap the Flash player, there is perhaps little if anything that might get you noticed by the crawler. RSS feeds maybe, who knows.

If humans were search engines

One theme also sounds through the webmaster guide lines. That is, develop your site around your users, not search engines. If it was ever so simple...

References

www.google.com/webmasters/guidelines.html

http://help.yahoo.com/help/us/ysearch/basics/basics-18.html

http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_GuidelinesforOptimizingSite.htm

All trademarks, which may be mentioned in this article for the purpose of discussion are copyrights of their respective owners

Wednesday, February 08, 2006

Web site powered by a State Machine Engine

My wife (who is an artist) has just launched her new website. Not an IT news per say, however it is what is under the hood of that website's navigation controls what matters. Its navigation logic is powered by a FSM, a Finite State Machine implementation, which we developed together in Perl and Java. One day, once it is fully documented, it perhaps will make its way to CPAN. In meanwhile, please take a look: http://svava.ca

Monday, February 06, 2006

The 'disk space unabailable' and Apache::Session::Lock::Semaphore on OpenBSD 3.8

Looks like the kern.seminfo.semmns (maximum number of SysV semaphores in the system) setting in default OpenBSD 3.8 install (60) is too low for Apache::Session::Lock::Semaphore (with no NSems argument being used)to work. The bizarre space unavailable error is all it spits out. To overcome this, one just needs to increase kern.seminfo.semmns to, say 100:

sysctl kern.seminfo.semmns=100

Or, to make it permanent, add 'kern.seminfo.semmns=100' to /etc/sysctl.conf.

Building GD-2.0.33 library on OpenBSD 3.8

Unaltered, it gives a compilation error (well at least it did in my case), complaining of png.h being unavailable. That is because configure script does not properly set -I compilation option. It sets it to /usr/local/include/libpng/libpng, whereas just /usr/local/include/libpng would be enough. For brute force solution just edit Makefile, find and delete a redundant libpng string from -I option.I sent an e-mail to GD maintainers so that they might consider a proper fix at some point.