Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Heather Floyd 604 posts 1002 karma points MVP 5x c-trib
    Jun 21, 2023 @ 16:59
    Heather Floyd
    0

    How to access a media file on Azure Blob Storage directly? (Not via http URL)

    Hello friends,

    I have some code which runs on a media file "Save" event that extracts the content from a PDF and stores it in a text field on the Media item for use in searching, etc. This is using TikaOnDotNet.TextExtraction.TextExtractor (see: https://github.com/KevM/tikaondotnet#usage)

    When running locally, this code works great - it can access the PDF file via the /Media/ folder and read it.

    The site is hosted on Umbraco Cloud, which uses Azure Blob Storage for media, so if the file cannot be located in the Media folder (which is generally empty on Cloud sites), it uses the URL of the media, and grabs it via new WebClient().DownloadData(uri). If the Media file is Added/Saved on the Live environment, this works just fine, since the URI is publicly accessible, however, if it is added on a Development or Staging environment, it fails because those environments are protected via Basic Auth.

    Can anyone recommend a way to read a media file from Azure Blob Storage on those protected environments?

  • Heather Floyd 604 posts 1002 karma points MVP 5x c-trib
    Jun 21, 2023 @ 18:10
    Heather Floyd
    100

    Thanks to help from Nik Rimington and Anders Bjerner, I was able to find a solution utilizing Umbraco's IMediaFileSystem.

    Stripped-down example using Dependency Injection:

    using Umbraco.Core.IO;
    private readonly IMediaFileSystem _mediaFileSystem;
    ...
    
    // Open a stream for reading the file contents
    using (var fs = _mediaFileSystem.OpenFile(mediaUmbracoFile))
    {
        if (fs != null)
        {
            var fileTextContent = mediaParser.ParseMediaText(fs,
                out extractedMetaFromTika);
            ...
        }
        else
        {
            _iLogger.Error(typeof(RegisterEventsComponent),
                new Exception($"Unable to open PDF file {fileInfo.FullName}"),
                "Unable to Open PDF file");
        }
    }
       ...
    
    public string ParseMediaText(Stream SourceStream, out Dictionary<string, string> MetaData)
    {
        var sb = new StringBuilder();
        var metaData = new Dictionary<string, string>();
        var textExtractor = new TextExtractor();
        try
        {
            using (var memoryStream = new MemoryStream())
            {
                SourceStream.CopyTo(memoryStream);
                var streamBytes = memoryStream.ToArray();
    
                var textExtractionResult = textExtractor.Extract(streamBytes);
                sb.Append(textExtractionResult.Text);
                metaData = (Dictionary<string, string>)textExtractionResult.Metadata;
            }
        }
        catch (Exception ex)
        {
            var msg = $"MediaParserService.ParseMediaText: Could not read media item provided by stream";
            throw new Exception(msg, ex);
        }
        MetaData = metaData;
        return sb.ToString();
    }
    

    Additionally, the tip for v9+ implementation is to look at https://github.com/umbraco/UmbracoExamine.PDF/blob/v11/dev/src/UmbracoExamine.PDF/PdfPigTextExtractor.cs

  • Gurumurthy 52 posts 125 karma points
    Aug 15, 2023 @ 23:26
    Gurumurthy
    0

    HI,

    This is part of private readonly IMediaFileSystem _mediaFileSystem Umbraco v8 right, how can we get teh same in Umbraco 11.

    Basically to get the media full path which is stored in Azure blob.

    Thanks,

Please Sign in or register to post replies

Write your reply to:

Draft