PS2-47: SAS Hash Objects are Memory Bound. So What?

  • Clinical Medicine & Research
  • August 2012,
  • 10
  • (3)
  • 194;
  • DOI: https://doi.org/10.3121/cmr.2012.1100.ps2-47

Abstract

Background/Aims If you are like most SAS programmers, you’ve heard amazing claims regarding SAS hash objects — lightning-fast lookups without the need to sort. You also know that hash objects are memory-bound, making them uselessfor your big data needs. This presentation outlines the benefits of hash objects when used with data step by-processing. GHRI uses hash objects for the VDW Utilization data build process. As you will see, hash objects are multi-purpose data containers that can dramatically reduce processing time, even when dealing with hundreds of millions of records. This talk will cover advanced features of the SAS hash object, including key summaries, check(), ref(), multikey objects, and sorting.

Methods Using hash objects with by-processing allows us to read our source data only once. This is a critical aspect of the current approach and the reason we achieve fantastic build times. The only requirement is that our source data be grouped. As each new by group is encountered, the hash objects are cleared. From ‘first dot’ to ‘last dot’, we load up our hash objects, putting procedures in one, diagnoses in another, and encounters in a third. On the last record of the by group, we out put our datasets, then begin the process again on the next group.

Results GHRI is now sourcing VDW Utilization data from our cost management data for 2004 to present. The input file is 130 gigabytes and has over 170 million records. We walk through this file only once, and output 5 datasets (utilization, diagnosis, procedure, invalid dx codes, and invalid px codes). The data are de-duped and sorted by consumer number and admit date. The process uses 26 hash objects and 16 hash iterator objects. The entire process is one data step. It runs in under 2 hours on a Windows 7 PC with 16 GB of ram.

Conclusions SAS hash objects are not just for lookups. They are flexible all-purpose data containers that allow programmers to exploit the inherent power of the data step.

Loading
  • Print
  • Download PDF
  • Article Alerts
  • Email Article
  • Citation Tools
  • Share
  • Bookmark this Article