Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Chris Houston 535 posts 980 karma points MVP admin c-trib
    Jul 22, 2009 @ 00:42
    Chris Houston
    0

    Ideas for automatic creating of Robot.txt rules.

    Hi Lee,

    My thoughts on this are based on a previous Umbraco 3 site that had issues where Umbraco kept getting it's knickers in a twist and outputting pages with unfriendly URLs, i.e. www.mydomain.com/nodeid.aspx which obviously were not URL's we wanted Google or any other search engine to Index, however, Google did, as I found when I checked the Google Webmaster Tools.

    I added these bad URL's to the robots.txt and then next time Google indexed our site the bad URL's were removed and the correct URL's appeared in the index.

    It made me think that when users remove pages from the Umbraco content section that those URL's just suddenly disappear, so it would be really good if there was a way of doing the following:

    a) Replacing the old page with a standard re-direct document ( that a the user selects where the dead page should now re-direct too ) this should exist for X number of days and then automatically recycle.

    b) Added a rule to the Robot.txt file to dis-allow the search engines from continuing to index the page.

    I'd be interested to hear yours and others thoughts on this.

    Cheers,

    Chris

  • Petr Snobelt 923 posts 1535 karma points
    Jul 22, 2009 @ 09:13
  • rasb 162 posts 218 karma points
    Jul 22, 2009 @ 10:49
    rasb
    2

    That's good idea Petr!

    In any case it would make sense to use it.

    I have tried to create a macro using xslt.

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp "&#x00A0;"> ]>
    <xsl:stylesheet 
        version="1.0" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
        xmlns:msxml="urn:schemas-microsoft-com:xslt"
        xmlns:umbraco.library="urn:umbraco.library" xmlns:Exslt.ExsltCommon="urn:Exslt.ExsltCommon" xmlns:Exslt.ExsltDatesAndTimes="urn:Exslt.ExsltDatesAndTimes" xmlns:Exslt.ExsltMath="urn:Exslt.ExsltMath" xmlns:Exslt.ExsltRegularExpressions="urn:Exslt.ExsltRegularExpressions" xmlns:Exslt.ExsltStrings="urn:Exslt.ExsltStrings" xmlns:Exslt.ExsltSets="urn:Exslt.ExsltSets" 
        exclude-result-prefixes="msxml umbraco.library Exslt.ExsltCommon Exslt.ExsltDatesAndTimes Exslt.ExsltMath Exslt.ExsltRegularExpressions Exslt.ExsltStrings Exslt.ExsltSets ">
    
    
    <xsl:output method="xml" omit-xml-declaration="yes"/>
    <xsl:param name="currentPage"/>
    <xsl:template match="/">
    
    <xsl:variable name="url" select="concat('http://',umbraco.library:RequestServerVariables('HTTP_HOST'))" />
    <link rel="canonical" href="{$url}{umbraco.library:NiceUrl($currentPage/@id)}" />
    
    </xsl:template>
    </xsl:stylesheet>

    /rasb

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Jul 22, 2009 @ 23:54
    Lee Kelleher
    0

    Hi Chris,

    You raise a lot of valid points - which I share your frustrations... to go through them point by point.

    Re: Google indexing the "mydomain.com/nodeid.aspx" URLs

    There was a discussion on my blog last week about Robots.txt and what Google indexes. It was suggested that Google only indexes pages that are linked to. So is it possible that somewhere in your Umbraco 3 site that there were links to content pages using the "mydomain.com/nodeid.aspx" style?  I know that the old RSS package used that URL structure for the <guid> tag... that could be the cause?

    Petr is right about the canonical meta tag - it's the latest what all the "kool kids" are using these days - and Google love it, (cleans up their indexes big time!) ... and rasb's XSLT will do the trick nicely!

    Re: Removing content pages from Umbraco

    Here's an idea! (using Umbraco v4 Events)

    When a page is deleted via the Umbraco back-end, the Delete events are triggered (BeforeDelete and AfterDelete) ... you could hook-up some code that could write the URL (of the page being deleted) to the robots.txt file. So when Google come around to re-indexing your site, it will see the "disallow" rule and remove that URL from it's index.

    I do have reservations about doing this, as the robots.txt is meant to be about exclusion - not about a list of pages that don't exist (404).

    Which leads me on to using a 404 handler to deal with it.

    There is an Umbraco book about "not found handlers": http://umbraco.org/documentation/books/not-found-handlers

    If the standard 404handler isn't enough, then you could look at putting together a custom 404 handler to check against a list of old (deleted) URLs and serve-up something accordingly?  (The list of old/deleted URLs could be populated via the BeforeDelete event, as mentioned above)

     

    Hope this helps in some way, let me know your thoughts - it's a good discussion!

    Cheers,

    - Lee

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Jul 26, 2009 @ 01:02
    Lee Kelleher
    0

    I was thinking about what happens when you rename a page in Umbraco - any existing links to that page break (hence why the /nodeid.aspx URL isn't such a bad GUID/permalink).

    Following on from my last post, one solution could be when a content document is updated, we hook into the Save event, check if the "page name" is different - if so, then we can insert/append the old page name to the "umbracoUrlAlias" field (if your doc-type has it).

    This way your old URLs aren't broken.

    I haven't wrote any code for this (yet) ... but when I do, I'll post it here.  Unless someone else likes the idea and writes the code?

    Cheers,

    - Lee

  • Chris Houston 535 posts 980 karma points MVP admin c-trib
    Jul 27, 2009 @ 10:29
    Chris Houston
    0

    Hi Lee,

    One thing that needs to be taken into account is that the user may well rename a page and then create a new page with the original name, so in this senario the re-direct would need to be cancelled. I think it should always be given to the user as an option when they rename / delete a page.

    I might have a play with this later this week if you've not already got it sorted :)

    Cheers,

    Chris

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Jul 27, 2009 @ 12:53
    Lee Kelleher
    0

    Hi Chris,

    As far as I am aware, the "umbracoUrlAlias" is used by a Not Found Handler (by default it's the first in the list) - so the aliases will only work if the original page/URL is not found.  So if the user/editor creates a new page - with the original page title/URL ... then it will be picked up first (and not by the "umbracoUrlAlias").

    Of course, to keep things clean, some code could be written to remove the "umbracoUrlAlias" from an old page ... but there would be some overhead with that (i.e. look-ups in the DB/XML cache, writing the property back to the database, etc).

    Code-wise, I doubt I'll have time to write anything like that in the next few weeks ... couple of client project's deadlines are looming, etc.

    But do let me know if you write anything, I'd be happy to help test, etc.

    Cheers,

    - Lee

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    Feb 15, 2010 @ 02:38
    Lee Kelleher
    0

    Hi guys,

    Following up on an old topic (from when I released the Robots.txt Editor) ... finally got around to coming up with a solution for renaming/deleting old content pages.  Behold the 301 Moved Permanently (NotFoundHandler)!  (I wanted to call it Permanent Redirect, but Peter Gregory got there first!)

    As an alternative to the canonical link, this package lets you add a new property to your document-type to include old/bad URLs. It works in the same way as the the "umbracoUrlAlias" property alias - but instead redirects the user to the new content page/node/URL, along with a 301 HTTP status code.

    Let me know if you use it... look forward to any feedback.

    Cheers, Lee.

  • Qube 74 posts 116 karma points
    May 17, 2010 @ 01:01
    Qube
    0

    I've been trying to tackle this issue too. My approach was to write a wrapper for UrlRewriting.config. Every umbraco install has UrlRewriting built in, so I figured it was the most open and reliable way to handle it.

    In a nutshell, you can add a "Url Manager" property to your document type, and it will list all the rules in UrlRewriting.config that apply to a piece of content. You can add new rules and remove old ones. Saving stores the changes in the database, not XML. Publishing commits the changes to XML, and if the primary page name has changed, a new rule pointing the old URL to the new one is created.

    The rules themselves are structured in such a way that they perform a 301 redirect.

    The extension needs some work before it's turned into a project (check for duplicates, better UI etc.), but it's already at work in our corporate website, and it works great so far.

  • Lee Kelleher 4020 posts 15802 karma points MVP 13x admin c-trib
    May 17, 2010 @ 08:28
    Lee Kelleher
    0

    Hi Ben, good idea... just checking, have you taken a look at the 301 URL Tracker package yet?

  • Qube 74 posts 116 karma points
    Jun 17, 2010 @ 06:26
    Qube
    0

    No I haven't. Looks pretty much perfect :)

    I've since abandoned my UrlRewriting wrapper, because it will never be able to support multiple domain setups in umbraco (limitation of UrlRewriting.net). Look forward to investigating UrlTracker!

Please Sign in or register to post replies

Write your reply to:

Draft