MacLochlainns Weblog

Michael McLaughlin's Technical Blog

Site Admin

MySQL Image Architecture

with 14 comments

The LinkedIn MySQL DB Development group posed a questions on how to handle images. Naturally, the argument always goes: Should images be deployed in the database or the file system? I believe they should be stored in the database because the cost and time associated is too high with regard to managing files, a file naming schema, and backing up the file system discretely from the database.

Since there’s a significant difference between the backup of transactional data and image data, they should be placed in different databases. The imagedb database is where you would place the images and large text descriptions, as shown in the MySQL Workbench ERD:

imagedb ERD

The imagedb ERD splits the foreign key references back to the system_user table, which contains the individual user credentials. The system_user table serves as the Access Control List (ACL) for the application.

Until I get a chance to write the code for this model, you can refer to the generic PHP/MySQL solution from several years back (its code source was last tested with PHP 5.3). As always, I hope this helps.

Written by maclochlainn

February 23rd, 2014 at 2:02 am

14 Responses to 'MySQL Image Architecture'

Subscribe to comments with RSS or TrackBack to 'MySQL Image Architecture'.

  1. Don’t use MEDIUM BLOB, just use the normal 4 byte pointer blob. 3 byte pointer gets you absolutely nothing (except trouble). No CPU has ever dealt with 24bit integers and saving 8 bits there isn’t going to mean anything with how InnoDB does blobs.

    Stewart Smith

    23 Feb 14 at 6:08 am

  2. Cross schema relations? Really? Don’t even get me started about “file name conventions”.

    I agree the file system (as in manually managed hierarchies) is not the best place. But a RDBMS surely is not the place to store blobs (most of the time).

    Especially considering that a schema actually is the property of the customer and its business case. Not something you create that is generally valid for all cases and just throw it at the customer…

    serverhorror

    23 Feb 14 at 6:52 am

  3. Stewart, Great point (MySQL Storage Requirements). I’ve updated the drawing. Thanks, Michael

    maclochlainn

    23 Feb 14 at 7:05 am

  4. Martin (?), I did acknowledge that there are two schools of thought. Obviously, you’re in one and I’m in the other. As to the owner of the schema, doesn’t ownership depend on whether you’ve a small application with a narrow use case (like a hosted customer) or a large application with several use cases?

    maclochlainn

    23 Feb 14 at 7:09 am

  5. To my opinion, File is not a relational data. Relational database is a wrong place to store non-relational data including files. filesystem is the right place for that.

    There are many reasons a DBA should oppose it.

    1. Every write operation to RDBMS will result in write into 3-4 places. Transaction logs (Redo info) and Rollback segment (Undo info) and final tablespace files etc and associated flushes also.
    Do you think 3-4 times overhead is worth for storing a file?

    2. IO is the major bottleneck in any RDBMS and any small hit will have tremendous impact on overall performance of databases.

    3. If images are stored in filesystem of the application server, there is performance hit on database. Any slow down of database will affect the complete application.

    Jobin Augustine

    23 Feb 14 at 10:43 am

  6. Jobin,

    I’d agree that some files are better suited to the application server, like those that display as icon images as part of your company’s generic graphics on webpage layouts. However, there are other images like those that are tied to catalog materials. Images tied to catalog materials are low write frequency, which means they have a nominal impact on transaction logs. Putting them in a separate database also lets you manage their transaction logs separately. Also, a large text or binary file can be considered as a non-key data attribute.

    Queries against these types of catalog images are directly linked to records. When you query the records you have two options, call another local function to read a mime-type and binary stream from the database, or return a URL value.

    A local function would return a mime-type header and converted binary stream to the browser, and the browser would render the image. If you return a URL, the browser determines the mime-type by reading the file type of the image to sets a header before reading the image from the file system. The physical location of the files are disclosed in primary read of the HTML source to the browser because images are read through separate asynchronous threads. This discloses either relative or absolute path information on the location of files to potential hackers, which means the location of images unrelated to the transaction. The trade off is the physical IO against the OS versus the physical IO against the database server, and I doubt there’s much difference between the two.

    maclochlainn

    23 Feb 14 at 3:06 pm

  7. The blobs are going to end up being stored off page, and you are going to do extra IO. The overflow pages are allocated in rather big chunks and lots of space is wasted by blobs.

    There are many technical reasons why you don’t want to store blobs in the database.

    Use a filesystem with transactional writes, like ZFS and updates points to blobs on the filesystem. This gives you scaleable storage for binary data (which is what a distributed filesystem is for) while allowing you to manage metadata in the RDBMS.

    This is essentially what Oracle would called a BFILE (a pointer to a binary file).

    Justin Swanhart

    23 Feb 14 at 3:46 pm

  8. Justin, I like your solution and agree with it. Notwithstanding, sometimes implementing the ZFS file system isn’t an available option as a solution in organizations. :-( When choosing the file system is a possibility, the downside is you still need a convention for file names.

    In that case, you would drop the imagedb database and mime_type table but keep the image and description tables with a couple changes. You’d drop the mime_type and item_image columns from the image table.

    maclochlainn

    23 Feb 14 at 5:10 pm

  9. My opinion is that it is not a good idea to store the images in the database. You may be better served by computing a SHA hash of your image, use that as the filename and then store the SHA hash in the database.
    If ease of back-ups is your worry, a trivial solution would be to store those files in a subdir under the database data dir and modify Xtrabackup wrapper script to tar up that dir after xtrabackup has done its thing.

    Antony Curtis

    23 Feb 14 at 8:50 pm

  10. Antony, that’s a great fix to disclosing the file names in the URL. Thanks, Michael

    maclochlainn

    23 Feb 14 at 10:49 pm

  11. Regarding the ownership: Whoever pays for the development of the schema is the owner.

    Regarding BLOBs in a DB. Think mail servers. There’s a reason why almost all storage of mails is outside of a RDBMS (generally not specific to MySQL). Write hits being only one if them.

    Creating a (cryptographic) checksum of the file and using that as the path is well established (mail spools, storage deduplication for free — also helps the OS cache,…)

    The thing that freaks me out is the extremely tight coupling between mass storage (images, files, BLOBs,…) and the meta data. How would you separate those? I do have a lot of binaries in the inventory(100s of 1000s) and I like to be able to nicely separate inventory and data at least to be able to scale the inventory independent of the files. How would you do that?

    The one reason to use BLOBs inside of a DB is not a technical one but rather to be able to create an atomic backup if all the inventory you have (Think: compliance).

    Then again with a fully qualified URI (http, file, s3,…) you can simply move data around implement another provider and let your customers use whatever they already have and there’s no need for them to migrate and no need for you to pay for the bandwith or additional capacity required to serve the images.

    But I’m getting carried away…I guess you are right. Let’s agree to disagree[×] :)

    Oh and I do like my nick. That’s why I put it in the name field, after all we are simply brabbling over technical details without either of us knowing enough about each others past experiences or current requirements.

    [×]: I keep the right to change my mind or past decisions at any time without further notice :)

    serverhorror

    24 Feb 14 at 8:35 am

  12. Antony’s solution works at very large scale. It is what we did at Easynews, because it is a de-duplication mechanism for content.

    In our case, the metadata was the post information from usenet, such as ‘subject’, ‘from’, ‘filetype’, etc.

    All files were hashed to a filesystem of directory format:
    a/ b/ c/ d/ … f/
    under each of those
    a/ b/ c/ d/ … f/
    under each of those
    a/ b/ c/ d/ … f/

    This allowed us to spread the “prefixes” of the hashes over many different devices, as distributed filesystems were not really invented yet when we did this. We used NetApp and EMC storage.

    Justin Swanhart

    24 Feb 14 at 11:31 am

  13. Neither database nor filesystem is always the right choice. Pick whichever is best for your use case. We have people storing video in chunks in the server for video on demand services, others storing images and whatever else.

    If the images are integral to your application you get the usual database benefits, like transactional guarantees and the reliability of replication along with easy use of the standard database scaling out solutions.

    If you’re doing a packaged, hosted application, do you really want to try to get a hosting service to give you great filesystem access when you can instead ask for a straightforward database server that they’ve done ten thousand times already? In such cases using the database by default can make life a lot easier for your customers, even if you provide other options.

    Try not to store the images in your main database server. This is bulk storage and can be very well suited to large SATA drives rather than SSD or 15K SCSI. It can be on the same hardware, but not great to be in the same mysqld. It is possible to combine the bulk storage with OLTP on the same mysqld but you’ll probably find that the best tunings are different and that you’re better off splitting them.

    It’s perhaps also worth knowing that back in 2004 we removed the article text from Wikipedia’s main tables and eventually even off the main database servers, instead storing them in multi-revision combined compressed blobs on SATA drive sets, still stored in MySQL on those.

    Character things also may not be best in the database server and your use case matters here as well. Bulk textual information, whether it’s varchar or text, is not something you want to have in very frequently updated or accessed tables, it does nasty things to your cache hit rates and to the sizes of the rows that you have to read. storing a name and address in a customer record is a common poor choice. Most of the customer references will be joins to do things like getting order summary information, not items that need their address.

    Remember that we have filesystems with a database underling layer that work well. That’s no surprise because a filesystem also has use for transactional integrity. Have a look at Microsoft’s relatively new ReFS, for example, that uses btrees on disk. (See http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx ).

    The specific use cases matter. Don’t get religious, consider each case and pick whichever way makes most sense for the task at hand. First thing to do is probably to ask whether the database features matter to your use case. If they do, that’d be a good argument for picking the database over the filesystem. But maybe not a sufficient one, consider the rest of the situation as well.

    James Day, MySQL Senior Principal Support Engineer, Oracle

    James Day

    24 Feb 14 at 2:02 pm

  14. James, Thanks for your comment! It was balanced, targeted, and considerate. I concur 100%, the use case should drive the solution and IO throughput isn’t the governing factor in all cases. ;-) Michael

    maclochlainn

    25 Feb 14 at 11:25 pm

Leave a Reply