How To: Exploiting Big Data with Indexes (2024)

How To: Exploiting Big Data with Indexes

By Kevin Neal posted 07-09-2013 15:41
See Also
regreSSHion: Uncovering CVE-2024-6387 in OpenSSH - A Critical Vulnerability | Splunk List - Splunk Documentation Design searches that populate summary events indexes

0 Recommend

Use Case: In today’s business environment, more than ever, it’s simply not good enough to be average. Organizations of all sizes have to strive to create competitive advantages, understand trends and gain better insight into operational efficiency. One of the most useful techniques to accomplish these goals is to Exploit Big Data through analysis. However, this is challenging due to the volume, velocity and variety of content that must be analyzed. Image-only files are useless in data analysis. Therefore, in order to take the all-important first step in exploiting all of your content is to apply indexes so that computer systems can properly begin to understand the information.

Reporting: Business executives are generally paid good money to make important decisions about the business and these decisions are often based on reports. These reports are often compiled from various data sources such as spreadsheets, interviews with customers or employees and possibly other documents. This method of gathering all this various data is not only time-consuming but it’s problematic due to the fact that the data is often presented in a inconsistent manner. For this reason you will want to use a Big Data system such asSplunkwhere business executives and have instant access to sets of data from various sources that is real-time information and presented through dashboards or graphics that can clearly show trends or other information that is pertinent to the decision making process.
Predictive analytics: Historical reporting is fantastic to analyze information yet this information is typically in the past. Imagine if you can proactively determine a trend or predict, with solid data, future events? This is a major benefit of Big Data aggregation. For example, given the right set of data you can probably predict where mortgage interest rates will increase or decrease in a particular geography. You would use statistics such as current available housing inventory supply, real-time unemployment rates as well as possibly the latest transactions within a certain time period. Also, using the same Big Data aggregation concept but for a completely different application is predictive analytics is in the field of Healthcare. If you can feed enough Index information into a Big Data solution then healthcare providers can narrow down much quicker the proper diagnose on people with illnesses where this can enrich people’s lives.
Business process improvement: There is always room for improvement and this is especially true in the business world and the most effective way to effect positive improvement is through the visibility to business processes themselves. Once you understand the process then you apply matrixes to these processes such as time needed to complete a task or steps needed to finish a project. A Big Data solution such as Splunk is an ideal complement to the efficiency improving technologies such as ABBYY Data Capture with tangible return on investment through reduced labor costs associated with manual data entry and Box with highly effective collaboration where enterprise workers can get work done quickly and be overall more effective in their business activities. Just by deploying a Big Data analysis system with Data Capture efficiency and Collaboration on mobile that is secure is absolutely one way to achieve better process improvement but just imagine all the possibilities that can be done with the data itself. And it all starts by Exploiting Big Data with Indexes.

Features	Benefits
Automatic indexing of relevant data Full-page for complete index Touch indexing for structured data extraction	Reduces costs associated with manual data entry Ability to analyze all data sets Offers ease of use for high user adoption

Solution Description:This solution might sound gaudy and complicated but it’s actually straight-forward and logical. There are three basic concepts which are Index Creation (ABBYY technology), Index Analysis (Splunk) and secure Image Storage (Box). We will use several technologies to create indexes for various reasons and then we will feed our Big Data system all these indexes so that this software can do what it does best. The Big Data system allows administrators to easily aggregate all this data and then create dashboards, reports and other useful business intelligence tools. So the process is quite logical: Capture indexes for all sources including existing databases, paper documents and, of course, images and send all these indexes to Big Data. Then send the images to Box for safe storage, easy access and effective collaboration.

System Requirements:

Note: This is a software developer and systems integrator solution. We are using Splunk as our Big Data aggregator in this solution because it is so easy to configure, yet extremely effective. Splunk can only perform well when you can provide lots of“Index” information. As seen in this graphic, “Index” is at the core for Big Data to even begin analyzing different data sets.

Box account
ABBYY FlexiCapturefor Automatic Data Capture
ABBYY Recognition Serverfor Full-Page recognition
ABBYY TouchTofor touch indexing
SplunkBig Data software (free download)

Configuration Steps (Complexity = Moderate to Involved):

Start Splunkand review chooseAdd data
Depending on the output type and format of indexes select the properSplunk Add Datafunction
Now connect Splunk to your data source(s)
1. For example, maybe Recognition Service you might choose ‘From files or directories’ and as an optionPreview data before indexing
2. …and for FlexiCapture you might choose the ‘any other data…’ then ‘Consume data from databases’ because you output to a SQL database directly
3. …and for TouchTo you might choose the ‘a file or directory of files’
After connecting all the index data sources to Splunk it is advisable to review theSplunk Manageroptions to familiarize yourself with all the various settings and configurations available
Now that you have configured Splunk to utilize Indexes from your various Data Capture and Conversion sources, you will want to gather information contained within Box. To do this a software developer would utilize theBox API(Application Programming Interface) to import data such astags,get commentsorget file info
A complete list of all theSplunk Indexescan be viewed in Manager
Once all the indexes have been aggregated within Splunk then organizations can truly realize the benefits of Big Data with detailed reporting, predictive analytics and/or improved business process via simple visual tools such asdashboards

Associated screen prints on this solution:

1. Splunk architecture with Index at the core

2. Start Splunk

3. Add data

4. Splunk add From files or directories

5. Data preview

6. Any other data…

7. Consume data from databases

8. Splunk add A file or directory of files

9. Splunk Manager

10. Splunk Indexes Manager

11. Splunk dashboard

What do you think? “Big Data” is still a relatively new idea and many use cases are just coming to light. How can you imagine using Big Data? The possibilities to innovate in this area are tremendous, do you have a story to tell?

#data #tag #indexes #indexing #box #tags #metadata #ScanningandCapture #BigData #tagging

0 comments

724 views

How To: Exploiting Big Data with Indexes (2024)

FAQs

What is big data indexing? ›

The idea of Big Data indexing is to fragment the datasets according to criteria that will be used frequently in query[14]. The fragments are indexed with each containing value satisfying some query predicates. This is aimed at storing the data in a more organized manner, thereby easing information retrieval.

Read On ›

What is the best way to analyze database indexes? ›

Use SQL tools like MySQL's EXPLAIN or Microsoft SQL Server's Query Execution Plan. These will give you a solid view of how queries are being executed and which indexes are well utilized. You can then more easily see where to add missing indexes and remove ones you no longer need.

Know More ›

How can indexes be used to optimize performance? ›

The main benefit of database indexes is that they can improve the performance of your queries by reducing the amount of data that the database engine has to scan, sort, or join. This can result in faster response times, lower resource consumption, and better user experience.

Get More Info ›

What are the techniques of indexing data? ›

Indexing is a very useful technique that helps in optimizing the search time in database queries. The table of database indexing consists of a search key and pointer. There are four types of indexing: Primary, Secondary Clustering, and Multivalued Indexing. Primary indexing is divided into two types, dense and sparse.

Explore More ›

How is indexing done in Hadoop? ›

In Distributed file system like HDFS, indexing is diffenent from that of local file system. Here indexing and searching of data is done using the memory of the HDFS node where data is residing. The generated index files are stored in a folder in directory where the actual data is residing."

Discover More Details ›

What are the three types of indexing? ›

Indexing is a technique that uses data structures to optimize the searching time of a database query. Index table contains two columns namely Search Key and Data Pointer or Reference. There are three types of indexing namely Ordered, Single-level, and multi-level.

What are indexed strategies? ›

Indexing Strategies: Definition

Indexing is – very simply – an investment strategy, which attempts to mimic the performance of a market index. An index is a “yardstick”, and a market index is a group or “basket” or portfolio of securities selected to represent and reflect the market as a whole.

See Details ›

What is the potential drawback of using indexes in a database? ›

The first and perhaps most obvious drawback of adding indexes is that they take up additional storage space. The exact amount of space depends on the size of the table and the number of columns in the index, but it's usually a small percentage of the total size of the table.

Learn More ›

How do indexes affect database performance? ›

Indexes greatly influence the efficiency of database operations. They can significantly speed up data retrieval but, on the other hand, can slow down data modification (insert, update, delete). When a query is run, the database searches through all records (a full table scan) to find the relevant rows.

Discover More Details ›

Which is a powerful technique of indexing? ›

1 Hash-based indexing

Hash-based indexing is a technique that uses a hash function to map each data value to a unique hash key, which is then stored in a hash table along with a pointer to the actual data location.

Read On ›

What is the methodology for indexing? ›

The specific way you index depends on how the Capture administrator set up the index profile. A typical method is to type a value in each field and press the Tab or Enter key to move to the next field. After you enter a value in the last field and press Tab or Enter, the next image is displayed.

Keep Reading ›

What are the basic steps of indexing? ›

Indexing steps

Crawl all pages of the seedlist and persist them to disk.
Extract the file content and persist it to disk.
Crawl a seedlist page from disk.
Index the seedlist entries into Lucene documents.
Write the documents to the Lucene index.
Repeat until all the persisted seedlist pages have been crawled.

Learn More ›

What do you mean by indexing data? ›

An index offers an efficient way to quickly access the records from the database files stored on the disk drive. It optimizes the database querying speed by serving as an organized lookup table with pointers to the location of the requested data.

Read On ›

What is the purpose of indexing? ›

Indexing, broadly, refers to the use of some benchmark indicator or measure as a reference or yardstick. In finance and economics, indexing is used as a statistical measure for tracking economic data such as inflation, unemployment, gross domestic product (GDP) growth, productivity, and market returns.

Know More ›

What is indexing in BigQuery? ›

When you index your data, BigQuery can optimize some queries that use the SEARCH function or other functions and operators, such as = , IN , LIKE , and STARTS_WITH .

Show Me More ›

Why is indexing important in data processing? ›

Document indexing is a tagging and categorization process that makes it easy to locate and retrieve specific pieces of information within a given set of documents. By identifying and extracting key identifiers from within each document, indexing enables near instantaneous retrieval of any file via text-based searches.

Keep Reading ›

How To: Exploiting Big Data with Indexes (2024)