BetterPerformanceWithFullText: Difference between revisions

From Request Tracker Wiki
Jump to navigation Jump to search
m (20 revisions imported)
 
(No difference)

Latest revision as of 15:03, 6 April 2016


This page is relevant for version 4.0.6 with a Postgres database. It might work for other versions (older and newer) with a few tweaks. There's a short note about Oracle at the end.

Warnings

The actions below will modify all the queries involving tickets and not only the full text ones. I do believe that the modified versions will perform at least as fast but hey... If you implement this and encounter any unwanted side effects, my email is somewhere in my profile.
I use GIN indexes and not the default GIST. What follows is very likely to be relevant to both types, I only tested with GIN though.


Symptom: 

You created the full text index following the documentation, you keep it up to date, however you hear complaints about performance of the full text search, especially when no other criteria are provided.

If you log your long queries you may find many occurences of such full text searches in your logs.


Explanation:

When we execute full text searches in RT, more often than not the full text index is not used, instead, a plan starting with a full scan on Tickets is preferred. In my experience, this happens for all searches where the term length is less than 6 characters and for some longer, but more common terms. This problem might be masked by modern, powerful hardware as the one I'm lucky enough to work with. Still, I managed to bring down the response time for some searches from anything up to 2 minutes to a couple of seconds.

The query in question looks like this:

SELECT DISTINCT main.* FROM Tickets main 
JOIN Transactions Transactions_1 
  ON ( Transactions_1.ObjectType = 'RT::Ticket' )
  AND ( Transactions_1.ObjectId = main.id ) 
JOIN Attachments Attachments_2 
  ON ( Attachments_2.TransactionId = Transactions_1.id )
WHERE (main.Status != 'deleted')
AND ((( Attachments_2.textsearchable @@ plainto_tsquery('<search terms>'))))
AND (main.Type = 'ticket')
AND (main.EffectiveId = main.id) ORDER BY main.id ASC LIMIT 50;


There's also a select count... similar with the query above in order to bring the total number of hits. This one is used for pagination (and usually takes a lot more time than the original query):

SELECT COUNT(DISTINCT main.id) FROM Tickets main 
JOIN Transactions Transactions_1 
  ON (Transactions_1.ObjectType = 'RT::Ticket')
  AND ( Transactions_1.ObjectId = main.id ) 
JOIN Attachments Attachments_2 
  ON ( Attachments_2.TransactionId = Transactions_1.id )
WHERE (main.Status != 'deleted')
AND ((( Attachments_2.textsearchable @@ plainto_tsquery('<search terms>'))))
AND (main.Type = 'ticket') AND (main.EffectiveId = main.id);

The following refers to this second query, it should apply to the first one too most of the time.

Here's the plan I get by default:

    
                                                         QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=25310.73..25310.74 rows=1 width=4)
   ->  Nested Loop  (cost=0.00..25306.87 rows=1542 width=4)
         ->  Nested Loop  (cost=0.00..18317.09 rows=13932 width=8)
               ->  Seq Scan on tickets main  (cost=0.00..5256.37 rows=789 width=8)
                     Filter: (((status)::text <> 'deleted'::text) AND (id = effectiveid) AND ((type)::text = 'ticket'::text))
               ->  Index Scan using transactions1 on transactions transactions_1  (cost=0.00..16.27 rows=23 width=8)
                     Index Cond: (((transactions_1.objecttype)::text = 'RT::Ticket'::text) AND (transactions_1.objectid = main.id))
         ->  Index Scan using attachments2 on attachments attachments_2  (cost=0.00..0.49 rows=1 width=4)
               Index Cond: (attachments_2.transactionid = transactions_1.id)
               Filter: (attachments_2.textsearchable @@ plainto_tsquery('<search terms>'::text))


Solution:

Give the optimizer more information about how often "effectiveid" and "id" are equal (most of the time actually). We can do this by adding a new index on the expression (effectiveid - id). The index itself is of no use for this or any other query (the value is 0 most of the times). The usefulness comes from the fact that Postgres calculates and stores statistics on expression indexes in the same way it does on table columns.

create index tickets_merged_helper on tickets((effectiveid - id));
analyze tickets;


The story runs something like that: Postgres doesn't do correlated statistics ("how often are 2 columns equal?"). So the optimizer says to itself something like that: "Aha, effectiveid should be pretty selective (most values are distinct), so let's do first for a full scan on Tickets, I'm sure I'll end up with fewer lines in the beginning stages of my query than if I were to use the rather costly path of full text".

Do I have any evidence for that? Why, yes, glad you asked. Let's explain 2 simple queries.

First, in the original form:

explain select * from tickets main where (main.Status != 'deleted') 
AND (main.Type = 'ticket') AND (main.EffectiveId = main.id);

                                   QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on tickets main  (cost=0.00..5256.37 rows=789 width=188)
  Filter: (((status)::text <> 'deleted'::text) AND ((type)::text = 'ticket'::text) AND (effectiveid = id))

And the changed form:

explain select * from tickets main where (main.Status != 'deleted') 
AND (main.Type = 'ticket') AND (main.EffectiveId - main.id = 0);

                                  QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on tickets main  (cost=0.00..5654.14 rows=154660 width=188)
  Filter: (((status)::text <> 'deleted'::text) AND ((type)::text = 'ticket'::text) AND ((effectiveid - id) = 0))


In the first case the optimizer thinks that there are only 789 rows fitting the criteria. In the second, 154660. Let's see which one is closer to reality:

select count(*) from tickets main where (main.Status != 'deleted') AND (main.Type = 'ticket') AND (main.EffectiveId = main.id);

count  
-------
154902
(1 row)

Right, so this shows that the index is effective - the optimizer has now a better estimate of the number of rows involved. Now we have to modify Tickets.pm to build the query with the new clause. Create a file called Tickets_Local.pm under /rt/path/local/lib/RT.

Add this on the first line:

package RT::Tickets;

Copy the sub FromSQL from /opt/rt4-INSTANCE/lib/RT/Tickets_SQL.pm and paste it in Tickets_Local.pm. Find the following code:

 unless ( exists $self->{_sql_looking_at}{'effectiveid'} ) {
       $self->SUPER::Limit( FIELD           => 'EffectiveId',
                            VALUE           => 'main.id',
                            ENTRYAGGREGATOR => 'AND',
                            QUOTEVALUE      => 0,
                          );
   }

and replace it with this:

  unless ( exists $self->{_sql_looking_at}{'effectiveid'} ) {
       $self->SUPER::Limit( FUNCTION         => '(main.EffectiveId - main.id)',
                           VALUE           => '0',
                           ENTRYAGGREGATOR => 'AND',
                           QUOTEVALUE      => 0,
                          );
   }

Close it, save it and restart RT. Here's how the final version of Tickets_Local.pm should look like (as of version 4.0.6):

package RT::Tickets;

sub FromSQL {

   my ($self,$query) = @_;
   {
       # preserve first_row and show_rows across the CleanSlate
       local ($self->{'first_row'}, $self->{'show_rows'});
       $self->CleanSlate;
   }
   $self->_InitSQL();
   return (1, $self->loc("No Query")) unless $query;
   $self->{_sql_query} = $query;
   eval { $self->_parser( $query ); };
   if ( $@ ) {
       $RT::Logger->error( $@ );
       return (0, $@);
   }
   # We only want to look at EffectiveId's (mostly) for these searches.
   unless ( exists $self->{_sql_looking_at}{'effectiveid'} ) {
       $self->SUPER::Limit( FUNCTION         => '(main.EffectiveId - main.id)',
                            VALUE           => '0',
                            ENTRYAGGREGATOR => 'AND',
                            QUOTEVALUE      => 0,
                          );
   }
   # Unless we've explicitly asked to look at a specific Type, we need
   # to limit to it.
   unless ( $self->{looking_at_type} ) {
       $self->SUPER::Limit( FIELD => 'Type', VALUE => 'ticket' );
   }
   # We don't want deleted tickets unless 'allow_deleted_search' is set
   unless( $self->{'allow_deleted_search'} ) {
       $self->SUPER::Limit( FIELD    => 'Status',
                            OPERATOR => '!=',
                            VALUE => 'deleted',
                          );
   }
   # set SB's dirty flag
   $self->{'must_redo_search'} = 1;
   $self->{'RecalcTicketLimits'} = 0;
   return (1, $self->loc("Valid Query"));
}

1;

The new count query will now look like that:

SELECT COUNT(DISTINCT main.id) FROM Tickets main 
JOIN Transactions Transactions_1 
  ON ( Transactions_1.ObjectType = 'RT::Ticket' )
  AND ( Transactions_1.ObjectId = main.id ) 
JOIN Attachments Attachments_2 
  ON ( Attachments_2.TransactionId = Transactions_1.id )
WHERE (main.Status != 'deleted')
AND ((( Attachments_2.textsearchable @@ plainto_tsquery('<search terms>'))))
AND (main.Type = 'ticket') 
AND ((main.EffectiveId - main.id = 0));


The new plan will vary according to other settings in the (and the size of the) database, as well as depending on the criteria in the search. The main improvement is that you will see much more often a scan on the full text index in the begining stages of the query. Here's a fairly likely one:

    
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=73802.44..73802.45 rows=1 width=4)
   ->  Nested Loop  (cost=5061.52..73751.79 rows=20260 width=4)
         ->  Nested Loop  (cost=5061.52..67403.95 rows=20842 width=4)
               ->  Bitmap Heap Scan on attachments attachments_2  (cost=5061.52..26991.85 rows=26165 width=4)
                     Recheck Cond: (textsearchable @@ plainto_tsquery('<search terms>'::text))
                     ->  Bitmap Index Scan on attachments_textsearch  (cost=0.00..5054.98 rows=26165 width=0)
                           Index Cond: (textsearchable @@ plainto_tsquery('<search terms>'::text))
               ->  Index Scan using transactions_pkey on transactions transactions_1  (cost=0.00..1.53 rows=1 width=8)
                     Index Cond: (transactions_1.id = attachments_2.transactionid)
                     Filter: ((transactions_1.objecttype)::text = 'RT::Ticket'::text)
         ->  Index Scan using tickets5 on tickets main  (cost=0.00..0.29 rows=1 width=4)
               Index Cond: (main.id = transactions_1.objectid)
               Filter: (((main.status)::text <> 'deleted'::text) AND ((main.type)::text = 'ticket'::text) AND ((main.effectiveid - main.id) = 0))


Note for Oracle users.

If you use 11g or later you might be able to ignore all of the above. Lookup "extended statistics" in the Oracle docs. You should be able to run something like that:

DBMS_STATS.CREATE_EXTENDED_STATS('YOUR_SCHEMA', 'TICKETS', 'EFFECTIVEID, ID');
DBMS_STATS.GATHER_TABLE_STATS('YOUR_SCHEMA', 'TICKETS' ....);

and hope for the best. I didn't test it.

If you use 10g, or if the 11g fix doesn't work, you should create a function index on (effectiveid, id), like above. Then look up the option 'FOR ALL HIDDEN COLUMNS' in the DBMS_STATS documentation.