Big data can be both a blessing and a curse, especially for public safety. It has the potential to provide insights into previously unknown connections or relationships between vast amounts of specific, seemingly disparate data points. As a result, public safety organizations become massively more efficient, save more lives and make better use of limited resources. But the sheer amount of data can quickly become burdensome without a clear understanding of what to do with it and how to analyze it.
Data scientists at public safety agencies routinely handle large amounts of data, from computer aided dispatch (CAD) systems, records management systems (RMS) or any number of public sources on a daily basis. One way that they cut down on the time and expense involved is by leveraging a newer file format called Parquet, or Apache Parquet.
Parquet is an open-source format that is faster to query and takes up far less space than comma-separated values (CSV). If it is faster and smaller, it inevitably saves money. While originally designed for complex data processing like business intelligence, machine learning and other artificial intelligence (AI) uses, Parquet can also be used for broader applications.
Here’s how Parquet works
Unlike CSV file formats which store data in rows, Parquet stores data in columns. Initially designed and released to the open-source community in 2013 by Cloudera and Twitter to run on Hadoop, its column-oriented structure is built from the ground up to run queries faster. This is also reflected by the write-once, read-many times paradigm in its architecture. How much faster is it? Anywhere between 4X and 40X faster.
Putting it to the real-world test
A simple “read” test conducted by CentralSquare Labs on a 20-million-record CAD data file returned a result in 15 seconds when in Parquet versus 66 seconds when in CSV. Writing to the file was even more impressive at 9 seconds for Parquet versus 363 seconds (over 6 minutes) in CSV (this example was from writing a Pandas DataFrame into the target format).
Due to Parquet’s columnar arrangement of data, compression is much more efficient. The previously referenced data set of 20 million records used 4.3 gigabytes of storage in CSV and only 12 megabytes in Parquet. Yes, that was “gigs” being compressed into “megs.”
Now for the real benefit — cost
Many agencies leverage Amazon Web Services (AWS) for several components of its work. The AWS model typically charges by the amount of data scanned per query and by the amount of data being stored, as do other cloud-based services. Data-processing frameworks like AWS Elastic MapReduce (EMR), data warehouses like AWS Redshift and storage services like AWS S3 all charge for processing and storage. Shifting to Parquet can redirect a large chunk of that expense – sometimes in the thousands of dollars per year – back to an agency’s budget through reduced costs for storage and data per query, as well as less time waiting for results from a few simple queries run on a daily or weekly basis.
Weighing the pros and cons
Like all technologies, Parquet does have a few drawbacks. Because it is a machine-readable binary format, it is very difficult for people to understand when looking at a raw file. If one wants to access all of the rows in each query, other row-based storage formats would be better, such as Apache Avro. Parquet does not allow old fields to be deleted, even though new ones can be added.
However, with Parquet’s ability to be accessed by languages such as SQL, and the support it has from all major cloud storage and service providers, from AWS to Google to Microsoft, its benefits far outweigh its drawbacks for data sets of a decent size.
Overall, within the public sector big data landscape, Parquet remains a very solid bet for public safety agencies who want to maximize the use of their data within the scope of limited resources, like time and money. When an agency has massive amounts of raw data, it could be hard to see trends and insights that could help law enforcement prevent future victimization. But by leveraging the Parquet format for its speed and size benefits, those crime data insights can be brought to light for effective change. If an agency’s CAD or RMS archives have a couple million records or more, dropping from gigabytes to mere megabytes of search-optimized records will save an agency a significant amount of time and money.